Publications

Papers on machine learning and data mining
Data mining with cellular automata
Tom Fawcett
SigKDD Explorations, July 2008, Volume 10, Issue 1
Abstract: A cellular automaton is a discrete, dynamical system composed of very simple, uniformly interconnected cells. Cellular automata may be seen as an extreme form of simple, localized, distributed machines. Many researchers are familiar with cellular automata through Conway's Game of Life. Researchers have long been interested in the theoretical aspects of cellular automata. This article explores the use of cellular automata for data mining, specifically for classification tasks. We demonstrate that reasonable generalization behavior can be achieved as an emergent property of these simple automata.
PDF: DMCA-dist.pdf
PRIE: A system for generating rulelists to maximize ROC performance
Tom Fawcett
Data Mining and Knowledge Discovery,  Volume 17, Number 2 / October, 2008.  Pages 207 - 224.
 DOI: 10.1007/s10618-008-0089-y
Abstract:   Rules are commonly used for classification because they are modular, intelligible and easy to learn. Existing work in classification rule learning assumes the goal is to produce categorical classifications to maximize classification accuracy. Recent work in machine learning has pointed out the limitations of classification accuracy: when class distributions are skewed, or error costs are unequal, an accuracy maximizing classifier can perform poorly. This paper presents a method for learning rules directly from ROC space when the goal is to maximize the area under the ROC curve (AUC). Basic principles from rule learning and computational geometry are used to focus search for promising rule combinations. The result is a system that can learn intelligible rulelists with good ROC performance.
Journal page: 10.1007/s10618-008-0089-y
Draft copy (PDF):
DMKD-UBDM-dist.pdf
PAV and the ROC Convex Hull
Tom Fawcett and Alexandru Niculescu-Mizil
Machine Learning, Volume 68,  Issue 1,  July 2007, pp. 97-106 
Abstract: Classifier calibration is the process of converting classifier scores into reliable probability estimates. Recently, a calibration technique based on isotonic regression has gained attention within machine learning as a flexible and effective way to calibrate classifiers. We show that, surprisingly, isotonic regression based calibration using the Pool Adjacent Violators algorithm is equivalent to the ROC convex hull method.
PDF: PAV-ROCCH-dist.pdf
ROC Graphs with Instance Varying Costs
Tom Fawcett
Pattern Recognition Letters (27), No. 8, June 2006, pp. 882-891.
Abstract:   Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs have been used in cost-sensitive learning because of the ease with which class skew and error cost information can be applied to them to yield cost-sensitive decisions. However, they have been criticized because of their inability to handle instance-varying costs; that is, domains in which error costs vary from one instance to another. This paper presents and investigates a technique for adapting ROC graphs for use with domains in which misclassification costs vary within the instance population.

PDF: ROC-ESC-dist.pdf

A Response to Webb and Ting's On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions
Tom Fawcett and Peter Flach
Machine Learning,v.58, n.1, pp.33-38, 2005
Abstract: In an article in this issue, Webb and Ting criticize ROC analysis for its inability to handle certain changes in class distributions. They imply that the ability of ROC graphs to depict performance in the face of changing class distributions has been overstated. In this editorial response, we describe two general types of domains and argue that Webb and Ting's concerns apply primarily to only one of them. Furthermore, we show that there are interesting real-world domains of the second type, in which ROC analysis may be expected to hold in the face of changing class distributions.

PDF: WT-rebuttal.pdf

Two articles in the Machine Learning special issue on Data Mining Lessons Learned (vol 57, no. 1-2)
  1. Editorial: Data Mining Lessons Learned
    Nada Lavrac, Hiroshi Motoda and Tom Fawcett
    PDF: dmll-editorial-ver5.pdf

  2. Introduction: Lessons Learned from Data Mining Applications and Collaborative Problem Solving
    Nada Lavrac, Hiroshi Motoda, Tom Fawcett, Rob Holte, Pat Langley and Pieter Adriaans
    PDF: dmll-intro-ver7.6.pdf

ROC Graphs: Notes and Practical Considerations for Researchers
Tom Fawcett
Previous version published as HP Labs Tech Report HPL-2003-4. This version has corrections, improvements and some new material.
Abstract: Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communities. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research.

Postscript: ROC101.ps.gz
PDF: ROC101.pdf

Note: The code referenced in this article is available from this software directory.   Unfortunately, the permanent URL (PURL) link given in the paper doesn't work.

"In vivo" spam filtering: A challenge problem for data mining

Tom Fawcett
KDD Explorations vol.5 no.2, December 2003.
Abstract:  Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them.

PDF: spam-KDDexp.pdf

Using Rule Sets to Maximize ROC Performance
Tom Fawcett
Presented at the 2001 IEEE International Conference on Data Mining (ICDM-01)
Abstract:   Rules are commonly used for classification because they are modular, intelligible and easy to learn. Existing work in classification rule learning assumes the goal is to produce categorical classifications to maximize classification accuracy. Recent work in machine learning has pointed out the limitations of classification accuracy; when class distributions are skewed, or error costs are unequal, an accuracy maximizing rule set can perform poorly. A more flexible use of a rule set is to produce instance scores indicating the likelihood that an instance belongs to a given class. With such an ability, we can apply rulesets effectively when distributions are skewed or error costs are unequal. This paper empirically investigates different strategies for evaluating rule sets when the goal is to maximize the scoring (ROC) performance.

Postscript: ICDM-final.ps.gz
PDF: ICDM-final.pdf

Handbook of Data Mining and Knowledge Discovery
F2. Fraud Detection
H1.2.1 Case study: Adaptive Fraud Detection
These are chapters that appear in W. Kloesgen and J. Zytkow (eds.) Handbook of Data Mining and Knowledge Discovery, Oxford University Press, 2001. Because of copyright issues, these chapters are not available for downloading. If you'd like to get a copy, send me email.

PLEASE NOTE: These articles are fairly dated now, and they were not intended as survey papers. If you're looking for a survey of machine learning or data mining techniques applied to fraud detection, I would recommend these papers:
  1. "A Comprehensive Survey of Data Mining-based Fraud Detection Research" by Phua, Lee and Gayler, available from this page.
  2. "Statistical Fraud Detection: A Review" by Bolton and Hand.
Robust Classification for Imprecise Environments
Foster Provost and Tom Fawcett
Machine Learning Journal, vol. 42, no. 3. March 2001. pp. 203-231
Abstract:  In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. We then show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions. This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and workforce utilization. In some cases, the performance of the hybrid can actually surpass that of the best known classifier. The hybrid is also efficient to build, to store, and to update. Finally, we point to empirical evidence that a robust hybrid classifier is needed for many real-world problems.

Postscript: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.ps (512K)
PDF: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.pdf (413K)

Activity Monitoring: Noticing interesting changes in behavior
Tom Fawcett and Foster Provost Presented at KDD-99 (Fifth International Conference on Knowledge Discovery and Data Mining)
Abstract:   We introduce a problem class which we term activity monitoring. Such problems involve monitoring the behavior of a large population of entities for interesting events requiring action. We present a framework within which each of the individual problems has a natural expression, as well as a methodology for evaluating performance of activity monitoring techniques. We show that two superficially different tasks, news story monitoring and intrusion detection, can be expressed naturally within the framework, and show that key differences in solution methods can be compared.

Postscript: KDD99.ps.gz (125K)

Here is an online presentation of this work that I presented at a Stanford Symposium on Anomaly Detection.

The Case Against Accuracy Estimation for Comparing Induction Algorithms
Foster Provost, Tom Fawcett and Ron Kohavi Presented at ICML-98 (Fifteenth International Conference on Machine Learning), July 1998
Abstract:  We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for comparing classifiers and draw into question the conclusions that can be drawn from such studies. In the course of the presentation, we describe and demonstrate what we believe to be the proper use of ROC analysis for comparative studies in machine learning research. We argue that this methodology is preferable both for making practical choices and for drawing scientific conclusions.

Postscript: ICML98-final.ps.gz

Robust Classification Systems for Imprecise Environments
Foster Provost and Tom Fawcett Presented at AAAI-98 (Fifteenth National Conference on Artificial Intelligence), July 1998
Abstract:  In real-world environments, it is usually difficult to specify target operating conditions precisely.  This uncertainty makes building robust classification systems problematic.  We show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions.  This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and workforce utilization.  In some cases, the performance of the hybrid can actually surpass that of the best known classifier.  The hybrid is also efficient to build, to store, and to update.  Finally, we provide empirical evidence that a robust hybrid classifier is needed for many real-world problems.

Postscript: aaai98-dist.ps.gz (81K)

Analysis and Visualization of Classifier Performance:
        Comparison under Imprecise Class and Cost Distributions
Foster Provost and Tom Fawcett Presented at KDD-97 (Third International Conference on Knowledge Discovery and Data Mining)
Winner of Best Paper Award (Best Fundamental Research)
Abstract:  When mining data with inductive methods, we often experiment with a wide variety of learning algorithms, using different algorithm parameters, varying output threshold values, and using different training regimens. Such experimentation yields a large number of classifiers to be evaluated and compared. In order to compare the performance of classifiers it is necessary to know the conditions under which they will be used; using accuracy alone is inadequate because class distributions and misclassification costs are rarely uniform.

Decision-theoretic principles may be used if the class and cost distributions are known exactly. Unfortunately, on real-world problems target cost and class distributions can rarely be specified precisely, and they are often subject to change. For example, in fraud detection we cannot ignore either type of distribution, nor can we assume that our distribution specifications are static or precise. We need a method for the management and comparison of multiple classifiers that is robust to imprecise and changing environments.

We introduce the ROC convex hull method, which combines techniques from ROC analysis, decision analysis and computational geometry. The method decouples classifier performance from specific class and cost distributions, and may be used to specify the subset of methods that are potentially optimal under any cost and class distribution assumptions.

Postscript:  KDD-97.ps.gz PDF: KDD-97.pdf

Adaptive Fraud Detection
Tom Fawcett and Foster Provost Published in Journal of Data Mining and Knowledge Discovery, v.1 n.3, 1997
Abstract:  One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a series of data mining techniques. Specifically, we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of customer transactions. Then the indicators are used to create a set of monitors, which profile legitimate customer behavior and indicate anomalies. Finally, the outputs of the monitors are used as features in a system that learns to combine evidence to generate high-confidence alarms. The system has been applied to the problem of detecting cellular cloning fraud based on a database of call records. Experiments indicate that this automatic approach outperforms hand-crafted methods for detecting fraud. Furthermore, this approach can adapt to the changing conditions typical of fraud detection environments.

Postscript: DMKD-97.ps.gz

Essay: On the Value of Applied Research in Machine Learning
Foster Provost, Tom Fawcett, Andrea Danyluk and Patricia Riddle Included in the Machine Learning List, V8,#7 (4/22/96)
Combining Data Mining and Machine Learning for Effective User Profiling
Tom Fawcett and Foster Provost Presented at KDD-96 (Second International Conference on Knowledge Discovery and Data Mining)
Extended Abstract:  In the United States, cellular fraud costs the telecommunications industry hundreds of millions of dollars per year. A specific kind of cellular fraud called cloning is particularly expensive and epidemic in major cities throughout the United States. Existing methods for detecting cloning fraud are ad hoc and their evaluation is virtually nonexistent. We have embarked on a program of systematic analysis of cellular call data for the purpose of designing and evaluating methods for detecting fraudulent behavior.

This paper presents a framework for automatically generating fraud detectors. The framework has several components, and uses data at two levels of aggregation. Massive numbers of cellular calls are first analyzed to determine general patterns of fraudulent usage. These patterns are then used to profile each individual customer's usage on an account-day basis. The profiles determine when a customer's behavior has become uncharacteristic in a way that suggests fraud.

Our framework includes a data mining component for discovering indicators of fraud. A constructive induction component generates profiling detectors that use the discovered indicators. A final evidence-combining component determines how to combine signals from the profiling detectors to generate alarms. The rest of this paper describes the domain, the framework and the implemented system, the data, and results.

Postscript: UserProfiling.ps.gz

Knowledge-based Feature Discovery for Evaluation Functions
Tom Fawcett
Computational Intelligence 12(1), February 1996.
Abstract: Since Samuel's work on checkers over thirty years ago, much effort has been devoted to learning evaluation functions. However, all such methods are sensitive to the feature set chosen to represent the examples. If the features do not capture aspects of the examples significant for problem solving, the learned evaluation function may be inaccurate or inconsistent. Typically, good feature sets are carefully handcrafted and a great deal of time and effort goes into refining and tuning them. This paper presents an automatic knowledge-based method for generating features for evaluation functions. The feature set is developed iteratively: features are generated, then evaluated, and this information is used to develop new features in turn. Both the contribution of a feature and its computational expense are considered in determining whether and how to develop it further. This method has been applied to two problem solving domains: the Othello board game and the domain of telecommunications network management. Empirical results show that the method is able to generate many known features and several novel features, and to improve concept accuracy in both domains.
PDF: kbfd.pdf

Modified 20-May-2008