PAKDD 2011 Tutorial: Data Stream Mining: Challenges and Techniques

The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining
24-27 May 2011 (Tue-Fri) - Shenzen, China

Latifur Khan1      Wei Fan2        Jiawei Han3      Jing Gao3     Mohammad M. Masud1

1University of Texas at Dallas, {lkhan, Mehedy}@utdallas.edu

2IBM T.J.Watson Research, weifan@us.ibm.com 

3University of Illinois at Urbana Champaign, {hanj,jinggao3}@uiuc.edu

Abstract

Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labeled data. Concept-drift occurs in data streams when the underlying concept of data changes over time. Concept-evolution occurs when new classes evolve in streams. Feature-evolution occurs when feature set varies with time in data streams. Data streams also suffer from scarcity of labeled data since it is not possible to manually label all the data points in the stream.  Each of these properties adds a challenge to data stream mining. This tutorial presents an organized picture on how to handle various data mining techniques in data streams: in particular, how to handle classification and clustering in evolving data streams by addressing these challenges.

 

About the Presenter:

·      Latifur R. Khan is currently an Associate Professor in the Computer Science department at the University of Texas at Dallas (UTD), where he has taught and conducted research since September 2000. He received his Ph.D. and M.S. degrees in Computer Science from the University of Southern California (USC), USA in August of 2000, and December of 1996 respectively. His research work is supported by grants from NASA, the Air Force Office of Scientific Research (AFOSR), National Science Foundation (NSF), the Nokia Research Center, CISCO, Texas Instruments, and Raytheon. In addition, Dr. Khan's research areas cover data mining, multimedia information management, and semantic web. He has published more than 150 papers in data mining, and database conferences, such as ICDM, ECML/PKDD, PAKDD, AAAI, ACM Multimedia, and journals such as VLDB, TKDE, TDSC, Web Semantics, Bio Informatics, KAIS etc. Dr. Khan has served a PC member of several conferences such as KDD, ICDM, SDM, and PAKDD.  Dr. Khan has also chaired and co-chaired several international conferences and workshops including IEEE Intelligence and Security Informatics (ISI) 2009 Conference, International Workshop on Cloud Privacy, Security, Risk & Trust (CPSRT 2010), in conjunction with IEEE CloudCom 2010, USA and ACM 6th International Workshop on Multimedia Data Mining (MDM/KDD2005). He has been invited to conduct tutorial sessions in prominent conferences such as ACM WWW 2005, MIS2005, DASFAA 2007, and WI 2008 ("Matching Words and Pictures - Problems, Applications, and Progress").

 

·      Wei Fan received his PhD in Computer Science from Columbia University in 2001 and has been working in IBM T.J.Watson Research since 2000. He published more than 60 papers in top data mining, machine learning and database conferences, such as KDD, SDM, ICDM, ECML/PKDD, SIGMOD, VLDB, ICDE, AAAI, ICML etc. Dr. Fan has served as Area Chair, Senior PC of SIGKDD'06, SDM'08 and ICDM'08/09, sponsorship co-chair of SDM'09, award committee member of ICDM'09, as well as PC of several prestigious conferences in the area including KDD'09/08/07/05, ICDM'07/06/05/04/03, SDM'09/07/06/05/04, CIKM'09/08/07/06, ECML/PKDD'07/06, ICDE'04, AAAI'07, PAKDD'09/08/07, EDBT'04, WWW'09/08/07, etc. He is on the advisory board of KD2U. Dr. Fan was invited to speak at ICMLA'06. He served as US NSF panelist in 2007/08. His main research interests and experiences are in various areas of data mining and database systems, such as, risk analysis, high performance computing, extremely skewed distribution, cost-sensitive learning, data streams, ensemble methods, easy-to-use nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, novel applications and commercial data mining systems. He is particularly interested in simple, unconventional, but effective methods to solve difficult problems. His thesis work on intrusion detection has been licensed by a start-up company since 2001. His co-teamed submission that uses Random Decision Tree has won the ICDM'08 Contest Crown Awards. His co-authored paper in ICDM'06 that uses "Randomized Decision Tree" to predict skewed ozone days won the best application paper award. His co-authored paper in KDD'97 on distributed learning system "JAM" won the runner-up best application paper award.

 

·      Jiawei Han is a professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, stream data mining, spatial and multimedia data mining, and bio-medical data mining, with over 300 conference and journal publications. He has chaired or served in over 100 program committees of international conferences and workshops, including ACM SIGKDD Conferences (2001 best paper award chair, 2002 student award chair, 1996 PC co-chair), SIAM-Data Mining Conferences (2001 and 2002 PC co-chair), ACM SIGMOD Conferences (2000 exhibit program chair), International Conferences on Data Engineering (2004 and 2002 PC vice-chair), International Conferences on Data Mining (2005 PC co-chair) and International Conference on Very Large Data Bases (2006 VLDB Americas Chair). He also served or is serving as EIC of ACM Transactions on Knowledge Discovery from Data and on the editorial boards for Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering, Journal of Intelligent Information Systems, and Journal of Computer Science and Technology. Jiawei has received the Outstanding Contribution Award at the 2002 International Conference on Data Mining, ACM Service Award (1999) and ACM SIGKDD Innovations Award (2004), and IEEE CS Technical Achievement Award (2005). He is an ACM and IEEE Fellow. He is the first author of the textbook “Data Mining: Concepts and Techniques" 2nd ed., (Morgan Kaufmann, 2006).

 

·      Jing Gao received the BEng and MEng degrees, both in Computer Science from Harbin Institute of Technology, China, in 2002 and 2004, respectively. She is currently working toward the Ph.D. degree in the Department of Computer Science, University of Illinois at Urbana Champaign. She is broadly interested in data and information analysis with a focus on data mining and machine learning. In particular, her research interests include ensemble methods, transfer learning, mining data streams and anomaly detection. She has published more than 20 papers in refereed journals and conferences, including KDD, NIPS, ICDCS, ICDM and SDM conferences.

 

·      Mohammad Mehedy Masud is a Post Doctoral Research Associate at the University of Texas at Dallas (UTD).  He received his Ph.D. degree from UTD in Computer Science (CS) in December 2009.  He graduated from Bangladesh University of Engineering and Technology with MS and BS in Computer Science and Engineering degree. His research interests are in data stream mining, machine learning, and intrusion detection using data mining. His recent research focuses on developing data mining techniques to classify data streams. He has published more than 20 research papers in journals including IEEE TKDE, and conferences including ICDM, ECML/PKDD and PAKDD.

 

 

A preliminary outline of the tutorial

1.   Introduction:  characteristics of data streams and challenges in stream mining

 

a)      Infinite length: Theoretically data streams are infinite, therefore efficient storage and incremental learning are required

b)      Concept-drift: The underlying concept changes over time, so the learner should adapt to this change

c)      Concept-evolution: New classes evolve in the stream, which makes classification difficult

d)     Feature-evolution: New features may also evolve in the stream, such as text stream

e)      Limited labeled data: Most of the data points in the stream remain unlabeled, which is a major challenge for supervise learning techniques

 

2.   Data stream classification:

 

a)      Single model incremental classification: Strives to cope with the concept-drift

b)      Ensemble-model based classification: An ensemble is maintained rather than a single model

                                   i.            Supervised: The models are trained in supervised fashion

                                 ii.            Semi-supervised: The models are trained with semi-supervised technique using both labeled and unlabeled data

                               iii.            Active learning: Data are chosen selectively for labeling, and those labeled data are used for training

 

3.   Clustering in evolving streams:

a)      Incremental and unsupervised/supervised clustering

b)      Incremental density-based clustering

 

4.   Novel Class Detection in streams: How new classes arriving in the stream can be automatically detected without any prior knowledge about those classes

a)      Single novel class: if only one novel class arrives in the stream

b)      Multiple novel classes: if more than one novel classes arrive

 

5.   Novel Class Detection in feature-evolving streams: How new classes can be automatically detected when the feature set evolves in the stream

a)      Evolving feature set and classification

b)      Feature space conversion

 

6.   Applications in various domains including security:

a)      Malware detection

b)      Novel topic detection in text streams

 

 

 

Duration

 

The duration of the tutorial will be 3 hour.