Publications

Papers on machine learning and data mining
BOOK:
Data Science for Business
Fundamental principles of data mining and data-analytic thinking

Foster Provost and Tom Fawcett
O'Reilly Media Inc
Expected July 2013.


Data Science for Business is a new book by Foster Provost and Tom Fawcett intended for those who need to understand data science/data mining, and those who want to develop their skill at data-analytic thinking.  Data Science for Business is not a book about algorithms.  Instead it presents a set of fundamental principles for extracting useful knowledge from data.  These fundamental principles are the foundation for many algorithms and techniques for data mining, but also underlie the processes and methods for approaching business problems data-analytically, evaluating particular data science solutions, and evaluating general data science plans.

Book webpage: www.data-science-for-biz.com
Data Science and its Relationship to Big Data and Data-driven Decision Making

Foster Provost and Tom Fawcett
Big Data Journal, Vol. 1, No. 1, March 2013


Abstract: Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot—even ‘‘sexy’’—career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner’s field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts. We close by offering, as examples, a partial list of fundamental principles underlying data science.

Official Journal Link
PDF:
BigDataProvostFawcett.pdf
Swarm Intelligence for Data Mining

David Martens, Bart Baesens and Tom Fawcett
Machine Learning, Vol. 82, No. 1, January 2011
From the issue entitled "Special Issue on Swarm Intelligence; Guest Editors: David Martens, Bart Baesens, and Tom Fawcett"

Abstract:This paper surveys the intersection of two fascinating and increasingly popular domains: swarm intelligence and data mining. Whereas data mining has been a popular academic topic for decades, swarm intelligence is a relatively new subfield of artificial intelligence which studies the emergent collective intelligence of groups of simple agents. It is based on social behavior that can be observed in nature, such as ant colonies, flocks of birds, fish schools and bee hives, where a number of individuals with limited capabilities are able to come to intelligent solutions for complex problems. In recent years the swarm intelligence paradigm has received widespread attention in research, mainly as Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO). These are also the most popular swarm intelligence metaheuristics for data mining. In addition to an overview of these nature inspired computing methodologies, we discuss popular data mining techniques based on these principles and schematically list the main differences in our literature tables. Further, we provide a unifying framework that categorizes the swarm intelligence based data mining algorithms into two approaches: effective search and data organizing. Finally, we list interesting issues for future research, hereby identifying methodological gaps in current research as well as mapping opportunities provided by swarm intelligence to current challenges within data mining research.

Official Springer Link
PDF:
fulltext.pdf
Data mining with cellular automata

Tom Fawcett
SigKDD Explorations, July 2008, Volume 10, Issue 1

Abstract: A cellular automaton is a discrete, dynamical system composed of very simple, uniformly interconnected cells. Cellular automata may be seen as an extreme form of simple, localized, distributed machines. Many researchers are familiar with cellular automata through Conway's Game of Life. Researchers have long been interested in the theoretical aspects of cellular automata. This article explores the use of cellular automata for data mining, specifically for classification tasks. We demonstrate that reasonable generalization behavior can be achieved as an emergent property of these simple automata.

PDF: DMCA-dist.pdf
PRIE: A system for generating rulelists to maximize ROC performance

Tom Fawcett
Data Mining and Knowledge Discovery,  Volume 17, Number 2 / October, 2008.  Pages 207 - 224.
 DOI: 10.1007/s10618-008-0089-y

Abstract:   Rules are commonly used for classification because they are modular, intelligible and easy to learn. Existing work in classification rule learning assumes the goal is to produce categorical classifications to maximize classification accuracy. Recent work in machine learning has pointed out the limitations of classification accuracy: when class distributions are skewed, or error costs are unequal, an accuracy maximizing classifier can perform poorly. This paper presents a method for learning rules directly from ROC space when the goal is to maximize the area under the ROC curve (AUC). Basic principles from rule learning and computational geometry are used to focus search for promising rule combinations. The result is a system that can learn intelligible rulelists with good ROC performance.

Journal page: 10.1007/s10618-008-0089-y
Draft copy (PDF):
DMKD-UBDM-dist.pdf
PAV and the ROC Convex Hull

Tom Fawcett and Alexandru Niculescu-Mizil
Machine Learning, Volume 68,  Issue 1,  July 2007, pp. 97-106 

Abstract: Classifier calibration is the process of converting classifier scores into reliable probability estimates. Recently, a calibration technique based on isotonic regression has gained attention within machine learning as a flexible and effective way to calibrate classifiers. We show that, surprisingly, isotonic regression based calibration using the Pool Adjacent Violators algorithm is equivalent to the ROC convex hull method.

PDF: PAV-ROCCH-dist.pdf
ROC Graphs with Instance Varying Costs

Tom Fawcett
Pattern Recognition Letters (27), No. 8, June 2006, pp. 882-891.

Abstract:   Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs have been used in cost-sensitive learning because of the ease with which class skew and error cost information can be applied to them to yield cost-sensitive decisions. However, they have been criticized because of their inability to handle instance-varying costs; that is, domains in which error costs vary from one instance to another. This paper presents and investigates a technique for adapting ROC graphs for use with domains in which misclassification costs vary within the instance population.

PDF: ROC-ESC-dist.pdf

A Response to Webb and Ting's On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions

Tom Fawcett and Peter Flach
Machine Learning,v.58, n.1, pp.33-38, 2005

Abstract: In an article in this issue, Webb and Ting criticize ROC analysis for its inability to handle certain changes in class distributions. They imply that the ability of ROC graphs to depict performance in the face of changing class distributions has been overstated. In this editorial response, we describe two general types of domains and argue that Webb and Ting's concerns apply primarily to only one of them. Furthermore, we show that there are interesting real-world domains of the second type, in which ROC analysis may be expected to hold in the face of changing class distributions.

PDF: WT-rebuttal.pdf

Two articles in the Machine Learning special issue on Data Mining Lessons Learned (vol 57, no. 1-2)

  1. Editorial: Data Mining Lessons Learned
    Nada Lavrac, Hiroshi Motoda and Tom Fawcett
    PDF: dmll-editorial-ver5.pdf

  2. Introduction: Lessons Learned from Data Mining Applications and Collaborative Problem Solving
    Nada Lavrac, Hiroshi Motoda, Tom Fawcett, Rob Holte, Pat Langley and Pieter Adriaans
    PDF: dmll-intro-ver7.6.pdf

ROC Graphs: Notes and Practical Considerations for Researchers

Tom Fawcett
Previous version published as HP Labs Tech Report HPL-2003-4. This version has corrections, improvements and some new material.

Abstract: Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communities. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research.

Postscript: ROC101.ps.gz
PDF: ROC101.pdf

Note: The code referenced in this article is available from this software directory.   Unfortunately, the permanent URL (PURL) link given in the paper doesn't work.

"In vivo" spam filtering: A challenge problem for data mining


Tom Fawcett
KDD Explorations vol.5 no.2, December 2003.

Abstract:  Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them.

PDF: spam-KDDexp.pdf

Using Rule Sets to Maximize ROC Performance

Tom Fawcett
Presented at the 2001 IEEE International Conference on Data Mining (ICDM-01)

Abstract:   Rules are commonly used for classification because they are modular, intelligible and easy to learn. Existing work in classification rule learning assumes the goal is to produce categorical classifications to maximize classification accuracy. Recent work in machine learning has pointed out the limitations of classification accuracy; when class distributions are skewed, or error costs are unequal, an accuracy maximizing rule set can perform poorly. A more flexible use of a rule set is to produce instance scores indicating the likelihood that an instance belongs to a given class. With such an ability, we can apply rulesets effectively when distributions are skewed or error costs are unequal. This paper empirically investigates different strategies for evaluating rule sets when the goal is to maximize the scoring (ROC) performance.

Postscript: ICDM-final.ps.gz
PDF: ICDM-final.pdf

Handbook of Data Mining and Knowledge Discovery
F2. Fraud Detection
H1.2.1 Case study: Adaptive Fraud Detection

These are chapters that appear in W. Kloesgen and J. Zytkow (eds.) Handbook of Data Mining and Knowledge Discovery, Oxford University Press, 2001. Because of copyright issues, these chapters are not available for downloading. If you'd like to get a copy, send me email.

PLEASE NOTE: These articles are fairly dated now, and they were not intended as survey papers. If you're looking for a survey of machine learning or data mining techniques applied to fraud detection, I would recommend these papers:
  1. "A Comprehensive Survey of Data Mining-based Fraud Detection Research" by Phua, Lee and Gayler, available from this page.
  2. "Statistical Fraud Detection: A Review" by Bolton and Hand.
Robust Classification for Imprecise Environments

Foster Provost and Tom Fawcett
Machine Learning Journal, vol. 42, no. 3. March 2001. pp. 203-231

Abstract:  In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. We then show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions. This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and workforce utilization. In some cases, the performance of the hybrid can actually surpass that of the best known classifier. The hybrid is also efficient to build, to store, and to update. Finally, we point to empirical evidence that a robust hybrid classifier is needed for many real-world problems.

Postscript: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.ps (512K)
PDF: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.pdf (413K)

Activity Monitoring: Noticing interesting changes in behavior

Tom Fawcett and Foster Provost Presented at KDD-99 (Fifth International Conference on Knowledge Discovery and Data Mining)

Abstract:   We introduce a problem class which we term activity monitoring. Such problems involve monitoring the behavior of a large population of entities for interesting events requiring action. We present a framework within which each of the individual problems has a natural expression, as well as a methodology for evaluating performance of activity monitoring techniques. We show that two superficially different tasks, news story monitoring and intrusion detection, can be expressed naturally within the framework, and show that key differences in solution methods can be compared.

Postscript: KDD99.ps.gz (125K)

Here is an online presentation of this work that I presented at a Stanford Symposium on Anomaly Detection.

The Case Against Accuracy Estimation for Comparing Induction Algorithms

Foster Provost, Tom Fawcett and Ron Kohavi Presented at ICML-98 (Fifteenth International Conference on Machine Learning), July 1998

Abstract:  We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for comparing classifiers and draw into question the conclusions that can be drawn from such studies. In the course of the presentation, we describe and demonstrate what we believe to be the proper use of ROC analysis for comparative studies in machine learning research. We argue that this methodology is preferable both for making practical choices and for drawing scientific conclusions.

Postscript: ICML98-final.ps.gz

Robust Classification Systems for Imprecise Environments

Foster Provost and Tom Fawcett Presented at AAAI-98 (Fifteenth National Conference on Artificial Intelligence), July 1998

Abstract:  In real-world environments, it is usually difficult to specify target operating conditions precisely.  This uncertainty makes building robust classification systems problematic.  We show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions.  This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and workforce utilization.  In some cases, the performance of the hybrid can actually surpass that of the best known classifier.  The hybrid is also efficient to build, to store, and to update.  Finally, we provide empirical evidence that a robust hybrid classifier is needed for many real-world problems.

Postscript: aaai98-dist.ps.gz (81K)

Analysis and Visualization of Classifier Performance:
        Comparison under Imprecise Class and Cost Distributions

Foster Provost and Tom Fawcett Presented at KDD-97 (Third International Conference on Knowledge Discovery and Data Mining)
Winner of Best Paper Award (Best Fundamental Research)

Abstract:  When mining data with inductive methods, we often experiment with a wide variety of learning algorithms, using different algorithm parameters, varying output threshold values, and using different training regimens. Such experimentation yields a large number of classifiers to be evaluated and compared. In order to compare the performance of classifiers it is necessary to know the conditions under which they will be used; using accuracy alone is inadequate because class distributions and misclassification costs are rarely uniform.

Decision-theoretic principles may be used if the class and cost distributions are known exactly. Unfortunately, on real-world problems target cost and class distributions can rarely be specified precisely, and they are often subject to change. For example, in fraud detection we cannot ignore either type of distribution, nor can we assume that our distribution specifications are static or precise. We need a method for the management and comparison of multiple classifiers that is robust to imprecise and changing environments.

We introduce the ROC convex hull method, which combines techniques from ROC analysis, decision analysis and computational geometry. The method decouples classifier performance from specific class and cost distributions, and may be used to specify the subset of methods that are potentially optimal under any cost and class distribution assumptions.


Postscript:  KDD-97.ps.gz PDF: KDD-97.pdf

Adaptive Fraud Detection

Tom Fawcett and Foster Provost Published in Journal of Data Mining and Knowledge Discovery, v.1 n.3, 1997

Abstract:  One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a series of data mining techniques. Specifically, we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of customer transactions. Then the indicators are used to create a set of monitors, which profile legitimate customer behavior and indicate anomalies. Finally, the outputs of the monitors are used as features in a system that learns to combine evidence to generate high-confidence alarms. The system has been applied to the problem of detecting cellular cloning fraud based on a database of call records. Experiments indicate that this automatic approach outperforms hand-crafted methods for detecting fraud. Furthermore, this approach can adapt to the changing conditions typical of fraud detection environments.

Postscript: DMKD-97.ps.gz

Essay: On the Value of Applied Research in Machine Learning

Foster Provost, Tom Fawcett, Andrea Danyluk and Patricia Riddle Included in the Machine Learning List, V8,#7 (4/22/96)
Combining Data Mining and Machine Learning for Effective User Profiling

Tom Fawcett and Foster Provost Presented at KDD-96 (Second International Conference on Knowledge Discovery and Data Mining)

Extended Abstract:  In the United States, cellular fraud costs the telecommunications industry hundreds of millions of dollars per year. A specific kind of cellular fraud called cloning is particularly expensive and epidemic in major cities throughout the United States. Existing methods for detecting cloning fraud are ad hoc and their evaluation is virtually nonexistent. We have embarked on a program of systematic analysis of cellular call data for the purpose of designing and evaluating methods for detecting fraudulent behavior.

This paper presents a framework for automatically generating fraud detectors. The framework has several components, and uses data at two levels of aggregation. Massive numbers of cellular calls are first analyzed to determine general patterns of fraudulent usage. These patterns are then used to profile each individual customer's usage on an account-day basis. The profiles determine when a customer's behavior has become uncharacteristic in a way that suggests fraud.

Our framework includes a data mining component for discovering indicators of fraud. A constructive induction component generates profiling detectors that use the discovered indicators. A final evidence-combining component determines how to combine signals from the profiling detectors to generate alarms. The rest of this paper describes the domain, the framework and the implemented system, the data, and results.


Postscript: UserProfiling.ps.gz

Knowledge-based Feature Discovery for Evaluation Functions

Tom Fawcett
Computational Intelligence 12(1), February 1996.

Abstract: Since Samuel's work on checkers over thirty years ago, much effort has been devoted to learning evaluation functions. However, all such methods are sensitive to the feature set chosen to represent the examples. If the features do not capture aspects of the examples significant for problem solving, the learned evaluation function may be inaccurate or inconsistent. Typically, good feature sets are carefully handcrafted and a great deal of time and effort goes into refining and tuning them. This paper presents an automatic knowledge-based method for generating features for evaluation functions. The feature set is developed iteratively: features are generated, then evaluated, and this information is used to develop new features in turn. Both the contribution of a feature and its computational expense are considered in determining whether and how to develop it further. This method has been applied to two problem solving domains: the Othello board game and the domain of telecommunications network management. Empirical results show that the method is able to generate many known features and several novel features, and to improve concept accuracy in both domains.

PDF: kbfd.pdf

Modified 29-Dec-2010