CS 6v81 -- Parallel Data Mining for Very Large Datasets

This is an independent study course on parallel data mining. The students in this class will study various parallel data mining algorithms for clustering, classification, event sequence mining, and probabilistic finite state automata mining. Then, the students will apply these algorithms on several large datasets.

Main Readings

  1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann, March 2006.
  2. A V Raman and J D Patrick, The sk-strings method for inferring PFSA, International Conference on Machine Learning, 1997.
  3. Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms, P. Duponta, F. Denisb, Y. Espositob, Pattern Recognition, Volume 38, Issue 9, September 2005, Pages 1349-1371
  4. Daniel Mercer, Clustering large datasets, Report, 2003, http://www.stats.ox.ac.uk/~mercer/documents/Transfer.pdf
  5. Ron Bekkerman and Martin Scholz, Data weaving: scaling up the state-of-the-art in data clustering, ACM Conference on Information and Knowledge Management, California, 2008, Pages 1083-1092.

Other Readings

  1. Distributed Data Mining Bibliography, http://www.csee.umbc.edu/~hillol/DDMBIB/
  2. Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, International Conference on Data Management, Canada, June 1996.
  3. Mahesh V. Joshi, Eui-Hong (Sam) Han, George Karypis, Vipin Kumar, Parallel Algorithms in Data Mining, CRPC Parallel Computing Handbook, 2000.
  4. M.J. Zaki, Ching-Tien Ho (Eds.), Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Vol. 1759, Springer, 2000. book.
  5. Jianwei Li, Ying Liu, Wei-keng Liao, Alok Choudhary Parallel Data Mining Algorithms for Association Rules and Clustering, CRC Press, LLC, 2006
  6. J. Pisharath, J. Zambreno, B. Ozisikyilmaz, and A. Choudhary, Accelerating Data Mining Workloads: Current Approaches and Future Challenges in System Architecture Design, International Workshop on High Performance Data Mining, April 2006.

Anomaly Detection

  1. Varun Chandola, Arindam Banerjee, Vipin Kumar Anomaly detection: A survey ACM Computing Surveys, Volume 41, Issue 3, July 2009,
  2. Elio Lozano and Edgar Acuna, Parallel algorithms for distance-based and density-based outliers, International Conference on Data Mining, Houston, Texas, August 2005.
  3. E. Hung and D. Cheung, Parallel Mining of Outliers in Large Database, Distributed and Parallel Databases, 12:5.26, July 2002.

Projects

The experimental platforms will include:

Links