Decision Theoretic Subsampling for Induction on Large Databases

Ron Musick
Computer Science Division
University of California
Berkeley, California 94720
musick@cs.berkeley.edu

and

Jason Catlett
Room 6H-524, AT&T Bell Laboratories
600 Mountain Avenue (P.O. Box 636)
Murray Hill, NY 07974-0636, USA
catlett@ulysses.att.com

and

Stuart Russell
Computer Science Division
University of California
Berkeley, California 94720
russell@cs.berkeley.edu

Abstract

Standard methods of constructing decision trees can be prohibitively expensive when induction algorithms are given very large training sets on which to compare attributes. This expense can often be avoided. By using a subsample for this calculation, we can get an approximation to the information gain used to assess each attribute. Selecting an attribute on this basis involves a certain risk, which depends on the subsample size and the information gain values observed. This paper addresses the questions of assessing when the choice may be made with a given expected error, and determining a sampling strategy that minimizes the computation cost of making it. The theory is entirely analogous to that used for decision-theoretic control of search. An empirical evaluation shows that an implementation performs well over an appropriately wide range of inputs on two fundamental tasks: deciding whether a choice of attribute can be safely made on a given sample, and if not, estimating how large a subsample is likely to be required to do so.

Appeared in

ML 1993

Look at the Paper (ps.gz, pdf.gz)