Decision Theoretic Subsampling for Induction on Large Databases
Ron Musick
Computer Science Division
University of California
Berkeley, California 94720
musick@cs.berkeley.edu
and
Jason Catlett
Room 6H-524, AT&T Bell Laboratories
600 Mountain Avenue (P.O. Box 636)
Murray Hill, NY 07974-0636, USA
catlett@ulysses.att.com
and
Stuart Russell
Computer Science Division
University of California
Berkeley, California 94720
russell@cs.berkeley.edu
Abstract
Standard methods of constructing decision trees can be
prohibitively expensive when induction algorithms are given
very large training sets on which to compare attributes. This
expense can often be avoided. By using a subsample for
this calculation, we can get an approximation to the information
gain used to assess each attribute. Selecting an attribute on
this basis involves a certain risk, which depends on the
subsample size and the information gain values observed.
This paper addresses the questions of assessing when the choice may be
made with a given expected error, and determining a sampling strategy
that minimizes the computation cost of making it. The theory is
entirely analogous to that used for decision-theoretic control of
search. An empirical evaluation shows that an implementation performs
well over an appropriately wide range of inputs on two fundamental
tasks: deciding whether a choice of attribute can be safely made on a
given sample, and if not, estimating how large a subsample is likely
to be required to do so.
Appeared in
ML 1993