PAKDD 2011 Tutorial: Data Stream Mining: Challenges
and Techniques
The
15th Pacific-Asia Conference on Knowledge Discovery and Data Mining
24-27 May 2011 (Tue-Fri) - Shenzen, China
Latifur
Khan1 Wei Fan2 Jiawei Han3 Jing
Gao3 Mohammad M. Masud1
1University of
2IBM T.J.Watson
Research, weifan@us.ibm.com
3University of
Abstract
Data streams are continuous flows of data. Examples
of data streams include network traffic, sensor data,
call center records and so on. Their sheer volume and speed pose a great
challenge for the data mining community to mine them. Data streams demonstrate
several unique properties: infinite length, concept-drift, concept-evolution,
feature-evolution and limited labeled data. Concept-drift occurs in data
streams when the underlying concept of data changes over time.
Concept-evolution occurs when new classes evolve in streams. Feature-evolution
occurs when feature set varies with time in data streams. Data streams also
suffer from scarcity of labeled data since it is not possible to manually label
all the data points in the stream. Each
of these properties adds a challenge to data stream mining. This tutorial
presents an organized picture on how to
handle various data mining techniques in data streams: in particular, how to
handle classification and clustering in evolving data streams by addressing
these challenges.
About the Presenter:
·
Latifur
R. Khan is currently an Associate Professor in the Computer
Science department at the
·
Wei Fan
received his PhD in Computer Science from Columbia University in 2001 and has
been working in IBM T.J.Watson Research since 2000.
He published more than 60 papers in top data mining, machine learning and
database conferences, such as KDD, SDM, ICDM, ECML/PKDD, SIGMOD, VLDB, ICDE,
AAAI, ICML etc. Dr. Fan has served as Area Chair, Senior PC of SIGKDD'06,
SDM'08 and ICDM'08/09, sponsorship co-chair of SDM'09, award committee member
of ICDM'09, as well as PC of several prestigious conferences in the area
including KDD'09/08/07/05, ICDM'07/06/05/04/03, SDM'09/07/06/05/04,
CIKM'09/08/07/06, ECML/PKDD'07/06, ICDE'04, AAAI'07, PAKDD'09/08/07, EDBT'04,
WWW'09/08/07, etc. He is on the advisory board of KD2U. Dr. Fan was invited to
speak at ICMLA'06. He served as US NSF panelist in 2007/08. His main research
interests and experiences are in various areas of data mining and database
systems, such as, risk analysis, high performance computing, extremely skewed
distribution, cost-sensitive learning, data streams, ensemble methods,
easy-to-use nonparametric methods, graph mining, predictive feature discovery,
feature selection, sample selection bias, transfer learning, novel applications
and commercial data mining systems. He is particularly interested in simple,
unconventional, but effective methods to solve difficult problems. His thesis
work on intrusion detection has been licensed by a start-up company since 2001.
His co-teamed submission that uses Random Decision Tree has won the ICDM'08
Contest Crown Awards. His co-authored paper in ICDM'06 that uses
"Randomized Decision Tree" to predict skewed ozone days won the best
application paper award. His co-authored paper in KDD'97 on distributed
learning system "JAM" won the runner-up best application paper award.
·
Jiawei
Han is
a professor in the Department of Computer Science,
·
Jing Gao received the BEng and MEng degrees, both in
Computer Science from Harbin Institute of Technology, China, in 2002 and 2004,
respectively. She is currently working toward the Ph.D. degree in the
Department of Computer Science,
·
Mohammad Mehedy
Masud is a Post Doctoral
Research Associate at the
A
preliminary outline of the tutorial
1.
Introduction: characteristics of data streams and
challenges in stream mining
a) Infinite length:
Theoretically data streams are infinite, therefore efficient storage and
incremental learning are required
b) Concept-drift:
The underlying concept changes over time, so the learner should adapt to this
change
c) Concept-evolution:
New classes evolve in the stream, which makes classification difficult
d) Feature-evolution:
New features may also evolve in the stream, such as text stream
e) Limited labeled data:
Most of the data points in the stream remain unlabeled, which is a major
challenge for supervise learning techniques
2.
Data stream
classification:
a) Single model
incremental classification: Strives to cope with the concept-drift
b) Ensemble-model
based classification: An ensemble is maintained rather than a single model
i.
Supervised: The models are trained in
supervised fashion
ii.
Semi-supervised: The models are trained
with semi-supervised technique using both labeled and unlabeled data
iii.
Active learning: Data are chosen
selectively for labeling, and those labeled data are used for training
3. Clustering in
evolving streams:
a) Incremental
and unsupervised/supervised clustering
b) Incremental
density-based clustering
4. Novel Class Detection
in streams: How new classes arriving in the stream can be automatically
detected without any prior knowledge about those classes
a) Single
novel class: if only one novel class arrives in the stream
b) Multiple
novel classes: if more than one novel classes arrive
5. Novel Class Detection in
feature-evolving streams: How new classes can be
automatically detected when the feature set evolves in the stream
a) Evolving
feature set and classification
b) Feature
space conversion
6. Applications
in various domains including security:
a) Malware
detection
b) Novel
topic detection in text streams
Duration
The duration of the tutorial will be 3
hour.