|
 |
Publications
Papers on machine learning and data
mining |
| Data mining
with cellular automata |
|
Tom Fawcett
SigKDD Explorations, July 2008, Volume 10, Issue 1 |
|
Abstract: A
cellular automaton is a discrete, dynamical system composed of
very simple, uniformly interconnected cells. Cellular automata
may be seen as an extreme form of simple, localized,
distributed machines. Many researchers are familiar with
cellular automata through Conway's Game of Life. Researchers
have long been interested in the theoretical aspects of
cellular automata. This article explores the use of cellular
automata for data mining, specifically for classification
tasks. We demonstrate that reasonable generalization behavior
can be achieved as an emergent property of these simple
automata. |
|
PDF: DMCA-dist.pdf |
| PRIE: A system for generating rulelists
to maximize ROC performance |
|
Tom Fawcett
Data Mining and Knowledge Discovery,
Volume 17, Number 2 / October, 2008. Pages 207 - 224. DOI: 10.1007/s10618-008-0089-y |
|
Abstract:
Rules are commonly used for classification because they are
modular, intelligible and easy to learn. Existing work in
classification rule learning assumes the goal is to produce
categorical classifications to maximize classification
accuracy. Recent work in machine learning has pointed out the
limitations of classification accuracy: when class
distributions are skewed, or error costs are unequal, an
accuracy maximizing classifier can perform poorly. This paper
presents a method for learning rules directly from ROC space
when the goal is to maximize the area under the ROC curve
(AUC). Basic principles from rule learning and computational
geometry are used to focus search for promising rule
combinations. The result is a system that can learn
intelligible rulelists with good ROC
performance. |
|
Journal page: 10.1007/s10618-008-0089-y Draft copy (PDF): DMKD-UBDM-dist.pdf |
| PAV and the ROC Convex
Hull |
|
Tom Fawcett and
Alexandru Niculescu-Mizil Machine Learning, Volume
68, Issue 1, July 2007, pp.
97-106 |
|
Abstract:
Classifier calibration is the process of converting
classifier scores into reliable probability estimates.
Recently, a calibration technique based on isotonic regression
has gained attention within machine learning as a flexible and
effective way to calibrate classifiers. We show that,
surprisingly, isotonic regression based calibration using the
Pool Adjacent Violators algorithm is equivalent to the ROC
convex hull method. |
|
PDF: PAV-ROCCH-dist.pdf |
| ROC Graphs with
Instance Varying Costs |
|
Tom Fawcett
Pattern Recognition Letters (27), No. 8, June 2006, pp.
882-891. |
|
Abstract: Receiver Operating
Characteristics (ROC) graphs are a useful technique for
organizing classifiers and visualizing their performance. ROC
graphs have been used in cost-sensitive learning because of
the ease with which class skew and error cost information can
be applied to them to yield cost-sensitive decisions. However,
they have been criticized because of their inability to handle
instance-varying costs; that is, domains in which error costs
vary from one instance to another. This paper presents and
investigates a technique for adapting ROC graphs for use with
domains in which misclassification costs vary within the
instance population. |
|
PDF: ROC-ESC-dist.pdf
|
| A Response to Webb
and Ting's On
the Application of ROC Analysis to Predict Classification
Performance Under Varying Class
Distributions |
|
Tom Fawcett and Peter Flach
Machine Learning,v.58, n.1, pp.33-38, 2005 |
|
Abstract: In an article in this issue, Webb
and Ting criticize ROC analysis for its inability to handle
certain changes in class distributions. They imply that the
ability of ROC graphs to depict performance in the face of
changing class distributions has been overstated. In this
editorial response, we describe two general types of domains
and argue that Webb and Ting's concerns apply primarily to
only one of them. Furthermore, we show that there are
interesting real-world domains of the second type, in which
ROC analysis may be expected to hold in the face of changing
class distributions. |
|
PDF: WT-rebuttal.pdf
|
| Two articles in the
Machine Learning special issue on Data Mining Lessons
Learned (vol 57, no. 1-2) |
|
-
Editorial: Data Mining Lessons
Learned
Nada Lavrac, Hiroshi Motoda and Tom Fawcett
PDF: dmll-editorial-ver5.pdf
-
Introduction:
Lessons Learned from Data Mining Applications and
Collaborative Problem Solving
Nada Lavrac, Hiroshi Motoda, Tom Fawcett, Rob Holte, Pat
Langley and Pieter Adriaans
PDF: dmll-intro-ver7.6.pdf
|
| ROC Graphs: Notes
and Practical Considerations for
Researchers |
|
Tom Fawcett
Previous version published as HP Labs Tech Report HPL-2003-4.
This version has corrections, improvements and some new
material. |
|
Abstract: Receiver Operating Characteristics
(ROC) graphs are a useful technique for organizing classifiers
and visualizing their performance. ROC graphs are commonly
used in medical decision making, and in recent years have been
increasingly adopted in the machine learning and data mining
research communities. Although ROC graphs are apparently
simple, there are some common misconceptions and pitfalls when
using them in practice. This article serves both as a tutorial
introduction to ROC graphs and as a practical guide for using
them in research. |
|
Postscript: ROC101.ps.gz
PDF: ROC101.pdf
Note: The code
referenced in this article is available from
this software directory. Unfortunately, the
permanent URL (PURL) link given in the paper doesn't
work.
|
|
"In vivo" spam filtering:
A challenge problem for data
mining
|
|
Tom Fawcett
KDD Explorations vol.5 no.2, December 2003.
|
|
Abstract: Spam, also known as
Unsolicited Commercial Email (UCE), is the bane of email
communication. Many data mining researchers have addressed the
problem of detecting spam, generally by treating it as a
static text classification problem. True in vivo spam
filtering has characteristics that make it a rich and
challenging domain for data mining. Indeed, real-world
datasets with these characteristics are typically difficult to
acquire and to share. This paper demonstrates some of these
characteristics and argues that researchers should pursue
in vivo spam filtering as an accessible domain for
investigating them. |
|
PDF: spam-KDDexp.pdf
|
| Using Rule Sets to
Maximize ROC Performance |
|
Tom
Fawcett
Presented at the 2001 IEEE International Conference on Data
Mining (ICDM-01) |
|
Abstract: Rules
are commonly used for classification because they are modular,
intelligible and easy to learn. Existing work in
classification rule learning assumes the goal is to produce
categorical classifications to maximize classification
accuracy. Recent work in machine learning has pointed out the
limitations of classification accuracy; when class
distributions are skewed, or error costs are unequal, an
accuracy maximizing rule set can perform poorly. A more
flexible use of a rule set is to produce instance scores
indicating the likelihood that an instance belongs to a given
class. With such an ability, we can apply rulesets effectively
when distributions are skewed or error costs are unequal. This
paper empirically investigates different strategies for
evaluating rule sets when the goal is to maximize the scoring
(ROC) performance. |
|
Postscript: ICDM-final.ps.gz
PDF: ICDM-final.pdf
|
Handbook of Data
Mining and Knowledge Discovery
F2. Fraud Detection
H1.2.1 Case study: Adaptive Fraud
Detection |
|
These are chapters that
appear in W. Kloesgen and J. Zytkow (eds.) Handbook of
Data Mining and Knowledge Discovery, Oxford University
Press, 2001. Because of copyright issues, these chapters are
not available for downloading. If you'd like to get a copy,
send me email.
PLEASE NOTE: These
articles are fairly dated now, and they were not
intended as survey papers. If you're looking for a survey of
machine learning or data mining techniques applied to fraud
detection, I would recommend these papers:
- "A Comprehensive
Survey of Data Mining-based Fraud Detection Research" by
Phua, Lee and Gayler, available from this
page.
-
"Statistical Fraud Detection: A Review" by Bolton and
Hand.
|
| Robust
Classification for Imprecise
Environments |
|
Foster Provost and Tom Fawcett
Machine Learning Journal, vol. 42, no. 3. March 2001.
pp. 203-231 |
|
Abstract: In real-world
environments it is usually difficult to specify target
operating conditions precisely. This uncertainty makes
building robust classification systems problematic. We present
a method for the comparison of classifier performance that is
robust to imprecise class distributions and misclassification
costs. The ROC convex hull method combines techniques from ROC
analysis, decision analysis and computational geometry, and
adapts them to the particulars of analyzing learned
classifiers. The method is efficient and incremental,
minimizes the management of classifier performance data, and
allows for clear visual comparisons and sensitivity analyses.
We then show that it is possible to build a hybrid classifier
that will perform at least as well as the best available
classifier for any target conditions. This robust performance
extends across a wide variety of comparison frameworks,
including the optimization of metrics such as accuracy,
expected cost, lift, precision, recall, and workforce
utilization. In some cases, the performance of the hybrid can
actually surpass that of the best known classifier. The hybrid
is also efficient to build, to store, and to update. Finally,
we point to empirical evidence that a robust hybrid classifier
is needed for many real-world problems. |
|
Postscript: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.ps
(512K)
PDF: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.pdf
(413K)
|
| Activity Monitoring:
Noticing interesting changes in
behavior |
|
Tom Fawcett and Foster Provost Presented
at KDD-99 (Fifth International Conference on Knowledge
Discovery and Data Mining) |
|
Abstract: We introduce a problem
class which we term activity monitoring. Such
problems involve monitoring the behavior of a large population
of entities for interesting events requiring action. We
present a framework within which each of the individual
problems has a natural expression, as well as a methodology
for evaluating performance of activity monitoring techniques.
We show that two superficially different tasks, news story
monitoring and intrusion detection, can be expressed naturally
within the framework, and show that key differences in
solution methods can be compared. |
|
Postscript: KDD99.ps.gz (125K)
Here is an online
presentation of this work that I presented at a Stanford
Symposium on Anomaly Detection.
|
| The Case Against
Accuracy Estimation for Comparing Induction
Algorithms |
|
Foster Provost, Tom Fawcett and Ron
Kohavi Presented at ICML-98 (Fifteenth International
Conference on Machine Learning), July 1998 |
|
Abstract: We analyze critically
the use of classification accuracy to compare classifiers on
natural data sets, providing a thorough investigation using
ROC analysis, standard machine learning algorithms, and
standard benchmark data sets. The results raise serious
concerns about the use of accuracy for comparing classifiers
and draw into question the conclusions that can be drawn from
such studies. In the course of the presentation, we describe
and demonstrate what we believe to be the proper use of ROC
analysis for comparative studies in machine learning research.
We argue that this methodology is preferable both for making
practical choices and for drawing scientific
conclusions. |
|
Postscript: ICML98-final.ps.gz
|
| Robust
Classification Systems for Imprecise
Environments |
|
Foster Provost and Tom Fawcett Presented
at AAAI-98 (Fifteenth National Conference on Artificial
Intelligence), July 1998 |
|
Abstract: In real-world environments,
it is usually difficult to specify target operating conditions
precisely. This uncertainty makes building robust
classification systems problematic. We show that it is
possible to build a hybrid classifier that will perform at
least as well as the best available classifier for any target
conditions. This robust performance extends across a
wide variety of comparison frameworks, including the
optimization of metrics such as accuracy, expected cost, lift,
precision, recall, and workforce utilization. In some
cases, the performance of the hybrid can actually surpass that
of the best known classifier. The hybrid is also
efficient to build, to store, and to update. Finally, we
provide empirical evidence that a robust hybrid classifier is
needed for many real-world problems. |
|
Postscript: aaai98-dist.ps.gz (81K)
|
Analysis and
Visualization of Classifier Performance:
Comparison under Imprecise Class and Cost
Distributions |
|
Foster Provost and Tom Fawcett Presented
at KDD-97 (Third International Conference on Knowledge
Discovery and Data Mining)
Winner of Best Paper Award (Best Fundamental
Research) |
|
Abstract:
When mining data with inductive methods, we often
experiment with a wide variety of learning algorithms, using
different algorithm parameters, varying output threshold
values, and using different training regimens. Such
experimentation yields a large number of classifiers to be
evaluated and compared. In order to compare the performance
of classifiers it is necessary to know the conditions under
which they will be used; using accuracy alone is inadequate
because class distributions and misclassification costs are
rarely uniform.
Decision-theoretic
principles may be used if the class and cost distributions
are known exactly. Unfortunately, on real-world problems
target cost and class distributions can rarely be specified
precisely, and they are often subject to change. For
example, in fraud detection we cannot ignore either type of
distribution, nor can we assume that our distribution
specifications are static or precise. We need a method for
the management and comparison of multiple classifiers that
is robust to imprecise and changing
environments.
We introduce the
ROC convex hull method, which combines techniques from ROC
analysis, decision analysis and computational geometry. The
method decouples classifier performance from specific class
and cost distributions, and may be used to specify the
subset of methods that are potentially optimal under any
cost and class distribution assumptions.
|
|
Postscript: KDD-97.ps.gz PDF: KDD-97.pdf
|
| Adaptive Fraud
Detection |
|
Tom Fawcett and Foster Provost Published
in Journal of Data Mining and Knowledge Discovery, v.1
n.3, 1997 |
|
Abstract: One method for detecting
fraud is to check for suspicious changes in user behavior.
This paper describes the automatic design of user profiling
methods for the purpose of fraud detection, using a series of
data mining techniques. Specifically, we use a rule-learning
program to uncover indicators of fraudulent behavior from a
large database of customer transactions. Then the indicators
are used to create a set of monitors, which profile legitimate
customer behavior and indicate anomalies. Finally, the outputs
of the monitors are used as features in a system that learns
to combine evidence to generate high-confidence alarms. The
system has been applied to the problem of detecting cellular
cloning fraud based on a database of call records. Experiments
indicate that this automatic approach outperforms hand-crafted
methods for detecting fraud. Furthermore, this approach can
adapt to the changing conditions typical of fraud detection
environments. |
|
Postscript:
DMKD-97.ps.gz
|
| Combining Data
Mining and Machine Learning for Effective User
Profiling |
|
Tom Fawcett and Foster Provost
Presented at KDD-96 (Second International Conference on
Knowledge Discovery and Data Mining) |
|
Extended
Abstract: In the United States, cellular
fraud costs the telecommunications industry hundreds of
millions of dollars per year. A specific kind of cellular
fraud called cloning is particularly expensive and
epidemic in major cities throughout the United States.
Existing methods for detecting cloning fraud are ad
hoc and their evaluation is virtually nonexistent. We
have embarked on a program of systematic analysis of
cellular call data for the purpose of designing and
evaluating methods for detecting fraudulent
behavior.
This paper presents
a framework for automatically generating fraud detectors.
The framework has several components, and uses data at two
levels of aggregation. Massive numbers of cellular calls are
first analyzed to determine general patterns of fraudulent
usage. These patterns are then used to profile each
individual customer's usage on an account-day basis. The
profiles determine when a customer's behavior has become
uncharacteristic in a way that suggests
fraud.
Our framework
includes a data mining component for discovering indicators
of fraud. A constructive induction component generates
profiling detectors that use the discovered indicators. A
final evidence-combining component determines how to combine
signals from the profiling detectors to generate alarms. The
rest of this paper describes the domain, the framework and
the implemented system, the data, and
results.
|
|
Postscript:
UserProfiling.ps.gz
|
| Knowledge-based
Feature Discovery for Evaluation
Functions |
|
Tom Fawcett
Computational Intelligence 12(1), February
1996. |
|
Abstract:
Since Samuel's work on checkers over thirty years ago, much
effort has been devoted to learning evaluation functions.
However, all such methods are sensitive to the feature set
chosen to represent the examples. If the features do not
capture aspects of the examples significant for problem
solving, the learned evaluation function may be inaccurate or
inconsistent. Typically, good feature sets are carefully
handcrafted and a great deal of time and effort goes into
refining and tuning them. This paper presents an automatic
knowledge-based method for generating features for evaluation
functions. The feature set is developed iteratively: features
are generated, then evaluated, and this information is used to
develop new features in turn. Both the contribution of a
feature and its computational expense are considered in
determining whether and how to develop it further. This method
has been applied to two problem solving domains: the Othello
board game and the domain of telecommunications network
management. Empirical results show that the method is able to
generate many known features and several novel features, and
to improve concept accuracy in both domains. |
|
PDF: kbfd.pdf |
| Modified 20-May-2008 |
 |
|