|
 |
Publications
Papers on machine learning and data
mining |
Swarm
Intelligence for Data Mining
|
|
David Martens, Bart
Baesens and Tom Fawcett
Machine Learning, Vol. 82, No. 1, January 2011
From the issue entitled "Special Issue on Swarm Intelligence; Guest Editors: David Martens, Bart Baesens, and Tom Fawcett"
|
|
Abstract:This
paper surveys the intersection of two fascinating and increasingly
popular domains: swarm intelligence and data mining. Whereas data
mining has been a popular academic topic for decades, swarm
intelligence is a relatively new subfield of artificial intelligence
which studies the emergent collective intelligence of groups of simple
agents. It is based on social behavior that can be observed in nature,
such as ant colonies, flocks of birds, fish schools and bee hives,
where a number of individuals with limited capabilities are able to
come to intelligent solutions for complex problems. In recent years the
swarm intelligence paradigm has received widespread attention in
research, mainly as Ant Colony Optimization (ACO) and Particle Swarm
Optimization (PSO). These are also the most popular swarm intelligence
metaheuristics for data mining. In addition to an overview of these
nature inspired computing methodologies, we discuss popular data mining
techniques based on these principles and schematically list the main
differences in our literature tables. Further, we provide a unifying
framework that categorizes the swarm intelligence based data mining
algorithms into two approaches: effective search and data organizing.
Finally, we list interesting issues for future research, hereby
identifying methodological gaps in current research as well as mapping
opportunities provided by swarm intelligence to current challenges
within data mining research.
|
|
Official Springer Link
PDF: fulltext.pdf
|
Data mining
with cellular automata
|
|
Tom Fawcett
SigKDD Explorations, July 2008, Volume 10,
Issue 1 |
|
Abstract: A
cellular automaton is a discrete, dynamical system composed of very
simple, uniformly interconnected cells. Cellular automata may be seen
as an extreme form of simple, localized, distributed machines. Many
researchers are familiar with cellular automata through Conway's Game
of Life. Researchers have long been interested in the theoretical
aspects of cellular automata. This article explores the use of cellular
automata for data mining, specifically for classification tasks. We
demonstrate that reasonable generalization behavior can be achieved as
an emergent property of these simple automata. |
|
PDF: DMCA-dist.pdf |
| PRIE: A system for generating rulelists to
maximize ROC performance |
|
Tom Fawcett
Data Mining and Knowledge Discovery, Volume
17, Number 2 / October, 2008. Pages 207 - 224. DOI: 10.1007/s10618-008-0089-y |
|
Abstract:
Rules are commonly used for classification because they are
modular, intelligible and easy to learn. Existing work in
classification rule learning assumes the goal is to produce categorical
classifications to maximize classification accuracy. Recent work in
machine learning has pointed out the limitations of classification
accuracy: when class distributions are skewed, or error costs are
unequal, an accuracy maximizing classifier can perform poorly. This
paper presents a method for learning rules directly from ROC space when
the goal is to maximize the area under the ROC curve (AUC). Basic
principles from rule learning and computational geometry are used to
focus search for promising rule combinations. The result is a system
that can learn intelligible rulelists with good ROC performance. |
|
Journal page: 10.1007/s10618-008-0089-y
Draft copy (PDF): DMKD-UBDM-dist.pdf |
| PAV and the ROC Convex Hull |
|
Tom Fawcett
and Alexandru Niculescu-Mizil
Machine Learning, Volume 68, Issue
1, July 2007, pp. 97-106 |
|
Abstract: Classifier
calibration is the process of converting classifier scores into
reliable probability estimates. Recently, a calibration technique based
on isotonic regression has gained attention within machine learning as
a flexible and effective way to calibrate classifiers. We show that,
surprisingly, isotonic regression based calibration using the Pool
Adjacent Violators algorithm is equivalent to the ROC convex hull
method. |
|
PDF: PAV-ROCCH-dist.pdf |
| ROC Graphs with
Instance Varying Costs |
|
Tom
Fawcett
Pattern Recognition Letters (27), No. 8, June 2006, pp. 882-891. |
|
Abstract:
Receiver Operating Characteristics (ROC) graphs
are a useful technique for organizing classifiers and visualizing their
performance. ROC graphs have been used in cost-sensitive learning
because of the ease with which class skew and error cost information
can be applied to them to yield cost-sensitive decisions. However, they
have been criticized because of their inability to handle
instance-varying costs; that is, domains in which error costs vary from
one instance to another. This paper presents and investigates a
technique for adapting ROC graphs for use with domains in which
misclassification costs vary within the instance population. |
|
PDF: ROC-ESC-dist.pdf
|
| A Response to Webb and
Ting's On the
Application of ROC Analysis to Predict Classification Performance Under
Varying Class Distributions |
|
Tom
Fawcett and Peter Flach
Machine Learning,v.58, n.1, pp.33-38, 2005 |
|
Abstract:
In an article in this issue, Webb and Ting criticize ROC
analysis for its inability to handle certain changes in class
distributions. They imply that the ability of ROC graphs to depict
performance in the face of changing class distributions has been
overstated. In this editorial response, we describe two general types
of domains and argue that Webb and Ting's concerns apply primarily to
only one of them. Furthermore, we show that there are interesting
real-world domains of the second type, in which ROC analysis may be
expected to hold in the face of changing class distributions. |
|
PDF: WT-rebuttal.pdf
|
| Two articles in the Machine
Learning special issue on Data Mining Lessons Learned (vol 57, no.
1-2) |
|
-
Editorial: Data Mining Lessons Learned
Nada Lavrac, Hiroshi Motoda and Tom Fawcett
PDF: dmll-editorial-ver5.pdf
-
Introduction:
Lessons Learned from Data Mining Applications and Collaborative Problem
Solving
Nada Lavrac, Hiroshi Motoda, Tom Fawcett, Rob Holte, Pat Langley and
Pieter Adriaans
PDF: dmll-intro-ver7.6.pdf
|
| ROC Graphs: Notes and
Practical Considerations for Researchers |
|
Tom
Fawcett
Previous version published as HP Labs Tech Report HPL-2003-4. This
version has corrections, improvements and some new material. |
|
Abstract:
Receiver Operating Characteristics (ROC) graphs are a
useful technique for organizing classifiers and visualizing their
performance. ROC graphs are commonly used in medical decision making,
and in recent years have been increasingly adopted in the machine
learning and data mining research communities. Although ROC graphs are
apparently simple, there are some common misconceptions and pitfalls
when using them in practice. This article serves both as a tutorial
introduction to ROC graphs and as a practical guide for using them in
research. |
|
Postscript: ROC101.ps.gz
PDF: ROC101.pdf
Note: The code
referenced in this article is available from
this software directory. Unfortunately, the permanent
URL (PURL) link given in the paper doesn't work.
|
|
"In vivo" spam filtering: A
challenge problem for data mining
|
|
Tom Fawcett
KDD Explorations vol.5 no.2, December 2003.
|
|
Abstract:
Spam, also known as Unsolicited Commercial Email (UCE),
is the bane of email communication. Many data mining researchers have
addressed the problem of detecting spam, generally by treating it as a
static text classification problem. True in vivo spam filtering
has characteristics that make it a rich and challenging domain for data
mining. Indeed, real-world datasets with these characteristics are
typically difficult to acquire and to share. This paper demonstrates
some of these characteristics and argues that researchers should pursue
in vivo spam filtering as an accessible domain for investigating
them. |
|
PDF: spam-KDDexp.pdf
|
| Using Rule Sets to
Maximize ROC Performance |
|
Tom Fawcett
Presented at the 2001 IEEE International Conference on Data Mining
(ICDM-01) |
|
Abstract: Rules
are commonly used for classification because they are modular,
intelligible and easy to learn. Existing work in classification rule
learning assumes the goal is to produce categorical classifications to
maximize classification accuracy. Recent work in machine learning has
pointed out the limitations of classification accuracy; when class
distributions are skewed, or error costs are unequal, an accuracy
maximizing rule set can perform poorly. A more flexible use of a rule
set is to produce instance scores indicating the likelihood that an
instance belongs to a given class. With such an ability, we can apply
rulesets effectively when distributions are skewed or error costs are
unequal. This paper empirically investigates different strategies for
evaluating rule sets when the goal is to maximize the scoring (ROC)
performance. |
|
Postscript: ICDM-final.ps.gz
PDF: ICDM-final.pdf
|
Handbook of Data
Mining and Knowledge Discovery
F2. Fraud Detection
H1.2.1 Case study: Adaptive Fraud Detection |
|
These are chapters that appear in W. Kloesgen and J.
Zytkow (eds.) Handbook
of Data Mining and Knowledge Discovery, Oxford University Press,
2001. Because of copyright issues, these chapters are not available for
downloading. If you'd like to get a copy, send me email.
PLEASE NOTE: These
articles are fairly dated now, and they were not intended as
survey papers. If you're looking for a survey of machine learning or
data mining techniques applied to fraud detection, I would recommend
these papers:
- "A Comprehensive
Survey of Data Mining-based Fraud Detection Research" by Phua, Lee and
Gayler, available from this page.
- "Statistical
Fraud Detection: A Review" by Bolton and Hand.
|
| Robust Classification
for Imprecise Environments |
|
Foster
Provost and Tom Fawcett
Machine Learning Journal, vol. 42, no. 3. March
2001. pp. 203-231 |
|
Abstract: In
real-world environments it is usually difficult to specify target
operating conditions precisely. This uncertainty makes building robust
classification systems problematic. We present a method for the
comparison of classifier performance that is robust to imprecise class
distributions and misclassification costs. The ROC convex hull method
combines techniques from ROC analysis, decision analysis and
computational geometry, and adapts them to the particulars of analyzing
learned classifiers. The method is efficient and incremental, minimizes
the management of classifier performance data, and allows for clear
visual comparisons and sensitivity analyses. We then show that it is
possible to build a hybrid classifier that will perform at least as
well as the best available classifier for any target conditions. This
robust performance extends across a wide variety of comparison
frameworks, including the optimization of metrics such as accuracy,
expected cost, lift, precision, recall, and workforce utilization. In
some cases, the performance of the hybrid can actually surpass that of
the best known classifier. The hybrid is also efficient to build, to
store, and to update. Finally, we point to empirical evidence that a
robust hybrid classifier is needed for many real-world problems. |
|
Postscript: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.ps
(512K)
PDF: http://www.stern.nyu.edu/~fprovost/Papers/rocch-mlj.pdf
(413K)
|
| Activity Monitoring:
Noticing interesting changes in behavior |
|
Tom
Fawcett and Foster Provost Presented at KDD-99 (Fifth
International Conference on Knowledge Discovery and Data Mining) |
|
Abstract:
We introduce a problem class which we term activity
monitoring. Such problems involve monitoring the behavior of a
large population of entities for interesting events requiring action.
We present a framework within which each of the individual problems has
a natural expression, as well as a methodology for evaluating
performance of activity monitoring techniques. We show that two
superficially different tasks, news story monitoring and intrusion
detection, can be expressed naturally within the framework, and show
that key differences in solution methods can be compared. |
|
Postscript: KDD99.ps.gz (125K)
Here is an online presentation
of this work that I presented at a Stanford Symposium on Anomaly
Detection.
|
| The Case Against
Accuracy Estimation for Comparing Induction Algorithms |
|
Foster
Provost, Tom Fawcett and Ron Kohavi Presented at
ICML-98 (Fifteenth International Conference on Machine Learning), July
1998 |
|
Abstract: We
analyze critically the use of classification accuracy to compare
classifiers on natural data sets, providing a thorough investigation
using ROC analysis, standard machine learning algorithms, and standard
benchmark data sets. The results raise serious concerns about the use
of accuracy for comparing classifiers and draw into question the
conclusions that can be drawn from such studies. In the course of the
presentation, we describe and demonstrate what we believe to be the
proper use of ROC analysis for comparative studies in machine learning
research. We argue that this methodology is preferable both for making
practical choices and for drawing scientific conclusions. |
|
Postscript: ICML98-final.ps.gz
|
| Robust Classification
Systems for Imprecise Environments |
|
Foster
Provost and Tom Fawcett Presented at AAAI-98 (Fifteenth
National Conference on Artificial Intelligence), July 1998 |
|
Abstract:
In real-world environments, it is usually difficult to
specify target operating conditions precisely. This uncertainty
makes building robust classification systems problematic. We show
that it is possible to build a hybrid classifier that will perform at
least as well as the best available classifier for any target
conditions. This robust performance extends across a wide variety
of comparison frameworks, including the optimization of metrics such as
accuracy, expected cost, lift, precision, recall, and workforce
utilization. In some cases, the performance of the hybrid can
actually surpass that of the best known classifier. The hybrid is
also efficient to build, to store, and to update. Finally, we
provide empirical evidence that a robust hybrid classifier is needed
for many real-world problems. |
|
Postscript: aaai98-dist.ps.gz (81K)
|
Analysis and
Visualization of Classifier Performance:
Comparison under Imprecise Class and Cost Distributions |
|
Foster
Provost and Tom Fawcett Presented at KDD-97 (Third
International Conference on Knowledge Discovery and Data Mining)
Winner of Best Paper Award (Best Fundamental Research) |
|
Abstract: When mining data with
inductive methods, we often experiment with a wide variety of learning
algorithms, using different algorithm parameters, varying output
threshold values, and using different training regimens. Such
experimentation yields a large number of classifiers to be evaluated
and compared. In order to compare the performance of classifiers it is
necessary to know the conditions under which they will be used; using
accuracy alone is inadequate because class distributions and
misclassification costs are rarely uniform.
Decision-theoretic
principles may be used if the class and cost distributions are known
exactly. Unfortunately, on real-world problems target cost and class
distributions can rarely be specified precisely, and they are often
subject to change. For example, in fraud detection we cannot ignore
either type of distribution, nor can we assume that our distribution
specifications are static or precise. We need a method for the
management and comparison of multiple classifiers that is robust to
imprecise and changing environments.
We introduce the
ROC convex hull method, which combines techniques from ROC analysis,
decision analysis and computational geometry. The method decouples
classifier performance from specific class and cost distributions, and
may be used to specify the subset of methods that are potentially
optimal under any cost and class distribution assumptions.
|
|
Postscript:
KDD-97.ps.gz PDF: KDD-97.pdf
|
| Adaptive Fraud
Detection |
|
Tom
Fawcett and Foster Provost Published in Journal of Data
Mining and Knowledge Discovery, v.1 n.3, 1997 |
|
Abstract:
One method for detecting fraud is to check for
suspicious changes in user behavior. This paper describes the automatic
design of user profiling methods for the purpose of fraud detection,
using a series of data mining techniques. Specifically, we use a
rule-learning program to uncover indicators of fraudulent behavior from
a large database of customer transactions. Then the indicators are used
to create a set of monitors, which profile legitimate customer behavior
and indicate anomalies. Finally, the outputs of the monitors are used
as features in a system that learns to combine evidence to generate
high-confidence alarms. The system has been applied to the problem of
detecting cellular cloning fraud based on a database of call records.
Experiments indicate that this automatic approach outperforms
hand-crafted methods for detecting fraud. Furthermore, this approach
can adapt to the changing conditions typical of fraud detection
environments. |
|
Postscript: DMKD-97.ps.gz
|
| Combining Data Mining
and Machine Learning for Effective User Profiling |
|
Tom
Fawcett and Foster Provost Presented at KDD-96 (Second
International Conference on Knowledge Discovery and Data Mining) |
|
Extended Abstract: In the United
States, cellular fraud costs the telecommunications industry hundreds
of millions of dollars per year. A specific kind of cellular fraud
called cloning is particularly expensive and epidemic in
major cities throughout the United States. Existing methods for
detecting cloning fraud are ad hoc and their evaluation is
virtually nonexistent. We have embarked on a program of systematic
analysis of cellular call data for the purpose of designing and
evaluating methods for detecting fraudulent behavior.
This paper
presents a framework for automatically generating fraud detectors. The
framework has several components, and uses data at two levels of
aggregation. Massive numbers of cellular calls are first analyzed to
determine general patterns of fraudulent usage. These patterns are then
used to profile each individual customer's usage on an account-day
basis. The profiles determine when a customer's behavior has become
uncharacteristic in a way that suggests fraud.
Our framework
includes a data mining component for discovering indicators of fraud. A
constructive induction component generates profiling detectors that use
the discovered indicators. A final evidence-combining component
determines how to combine signals from the profiling detectors to
generate alarms. The rest of this paper describes the domain, the
framework and the implemented system, the data, and results.
|
|
Postscript: UserProfiling.ps.gz
|
| Knowledge-based
Feature Discovery for Evaluation Functions |
|
Tom Fawcett
Computational Intelligence 12(1), February 1996. |
|
Abstract: Since
Samuel's work on checkers over thirty years ago, much effort has been
devoted to learning evaluation functions. However, all such methods are
sensitive to the feature set chosen to represent the examples. If the
features do not capture aspects of the examples significant for problem
solving, the learned evaluation function may be inaccurate or
inconsistent. Typically, good feature sets are carefully handcrafted
and a great deal of time and effort goes into refining and tuning them.
This paper presents an automatic knowledge-based method for generating
features for evaluation functions. The feature set is developed
iteratively: features are generated, then evaluated, and this
information is used to develop new features in turn. Both the
contribution of a feature and its computational expense are considered
in determining whether and how to develop it further. This method has
been applied to two problem solving domains: the Othello board game and
the domain of telecommunications network management. Empirical results
show that the method is able to generate many known features and
several novel features, and to improve concept accuracy in both domains. |
|
PDF: kbfd.pdf |
| Modified
29-Dec-2010 |
 |
|