Data Mining:
Foundation, Techniques and Applications
|
Date |
Session |
Topics |
4th Dec 2007 |
Morning |
|
Afternoon |
||
Night |
||
5th Dec 2007 |
Morning |
|
Afternoon |
||
6th Dec 2007 |
Morning |
Association Rule, Frequent Pattern
(II), Classification
& Regression (I) |
Afternoon |
||
Night |
||
7th Dec 2007 |
Morning |
|
Afternoon |
||
Night |
||
8th Dec 2007 |
Morning |
|
Afternoon |
Timing:
Morning: 9am-12pm
Afternoon: 2pm-5pm
Night: 7pm-10pm
Texts:
[HMS01]: Principles of Data Mining, David Hand, Heikki Mannila and Padhraic Smyth, The MIT Press, 2001
[HK01]: Data Mining: Concepts and Techniques, Second Edition, J. Han and M. Kamber, Morgan Kaufmann, 2006
[Mit97]: Machine Learning, Tom M. Mitchell, McGrawHill, 1997.
Readings: (Tentative)
Introduction/A Quick Overview
Machine Learning and Statistic
Database Techniques I (Indexing)
· A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching." SIGMOD Conference 1984: 47-57
· Nick Roussopoulos, Steve Kelley, and F. Vincent. “Nearest Neighbor Queries”. Proc. of ACM-SIGMOD, pages 71--79, May 1995.
· R. Baeza-Yates, et al, Modern Information Retrieval (Acm Press Series), 1999, Chapter 8.1-8.3
Database Techniques II (Pre-Computation)
· J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichaxt, and M. Venkatrao. “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”. In DMKD, pages 29-53. Kluwer Academic Publishers, 1997
· Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Implementing Data Cubes Efficiently. SIGMOD Conference 1996: 205-216
· H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. “Optimal Histograms with Quality Guarantees”. VLDB 1998
· HK06: Chapter 3-4
Association Rule, Frequent Pattern (I)
· http://www.cs.helsinki.fi/u/salmenki/dmcourse/chap12.pdf Chapter 2
·
Mohammed J. Zaki, Srinivasan Parthasarathy,
Mitsunori Ogihara, Wei Li, "New
Algorithms for Fast Discovery of Association Rules", 3rd
International Conference on Knowledge Discovery and Data Mining (KDD), pp
283-286, Newport, California, August, 1997. (Journal
Version in 2000)
Note: The first paper the introduce depth first
search for frequent pattern mining.
·
P.Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia,
M. Bawa, D. “Turbo-Charging Vertical
Mining of Large Database.” Proc. 2000 ACM-SIGMOD Int. Conf. on Management
of Data (SIGMOD'00), Dallas, TX, May 2000.
Note: Tested on a dataset of 1.2GB, one of the
largest that I have seen in data mining literatures. (Code Available Here)
· Gao Cong, Beng Chin Ooi, Kian-Lee Tan, Anthony K. H. Tung, "Go Green: Recycle and Reuse Frequent Patterns". International Conference on Data Engineering (ICDE'2004), Boston, 2004
Optional References:
· Zijian Zheng, Ron Kohavi, Llew Mason. Real World Performance of Association Rule Algorithms. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
· H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, "ItCompress: An Iterative Semantic Compression Algorithm".International Conference on Data Engineering (ICDE'2004), Boston, 2004.
Association Rule, Frequent Pattern (II)
·
Mohammed J. Zaki, “Efficient
Enumeration of Frequent Sequences”, 7th International Conference on
Information and Knowledge Management, pp 68-75, Washington DC, November 1998.
Note: The first paper that introduce depth first search for sequential pattern
mining.
·
Kevin S. Beyer, Raghu Ramakrishnan: Bottom-Up Computation
of Sparse and Iceberg CUBEs. SIGMOD Conference 1999: 359-370.
Note: The first paper that use pointer-based algorithm for frequent pattern
mining.
· Mohammed J. Zaki, Ching-Jui Hsiao, CHARM: An Efficient Algorithm for Closed Itemset Mining, 2nd SIAM International Conference on Data Mining, Arlington, April 2002.
· Cuiping Li, Gao Cong, Anthony K.H. Tung, Shan Wang. “Incremental Maintenance of Quotient Cube for Median”. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04) 2004
· HK06: Chapter 5.5
Classification & Regression (I)
Classification & Regression (II)
Clustering (I)
Optional References:
Clustering (II)
· HK06: Chapter 7.9- 7.11
· Anthony K. H. Tung, Jean Hou, Jiawei Han, “Clustering in the Presence of Obstacles”. In Proc. of 17th International Conference on Data Engineering (ICDE’01, Heidelberg, Germany p359-367.
· Anthony K. H. Tung, Raymond T. Ng, Laks V. S. Lakshmanan,, Jiawei Han, “Constrained Clustering on Large Database” , Proc. 8th Intl. Conf. on Database Theory (ICDT’01), London, UK, Jan. 2001, p405-419.
· Wen Jin, Anthony K. H. Tung , Jiawei. Han, “Finding Top-n Local Outliers in Large Database”, in 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (SIGKDD’01)
Ranking: Skyline/Dominance Relationship Analysis
· Stephan Börzsönyi, Donald Kossmann, Konrad Stocker: "The Skyline Operator". ICDE 2001: 421-430
· Papadias, D., Tao, Y., Fu, G., Seeger, B. "An Optimal and Progressive Algorithm for Skyline Queries". Proceedings of the ACM Conference on the Management of Data (SIGMOD), pp. 443-454, San Diego, CA, June 9-12, 2003.
· Cuiping Li, Beng Chin Ooi, Anthony K. H. Tung, Shan Wang. "DADA: A Data Cube for Dominant Relationship Analysis", In Proceedings, ACM SIGMOD Int'l. Conference on Management of Data , Chicago, 2006 (SIGMOD'06)
· Chee-Yong Chan, H. V. Jagadish, Kian-Lee Tan, Anthony K. H. Tung, Zhenjie Zhang. "Finding k-Dominant Skylines in High Dimensional Space", In Proceedings, ACM SIGMOD Int'l. Conference on Management of Data , Chicago, 2006 (SIGMOD'06)
· Cuiping Li, Anthony K. H. Tung, Wen Jin, Martin Ester. "On Dominating Your Neighborhood Profitably". accepted and to appear in the 33rd Very Large Data Bases conference (VLDB'07), Vienna, 2007
Optional References:
·
Chee-Yong Chan, H. V.
Jagadish, Anthony K.H. Tung, Kian-Lee Tan, Zhenjie Zhang. "On High Dimensional Skylines
". In EDBT 2006
·
Wen Jin, Anthony K. H.
Tung, Martin Ester and Jiawei Han,''On
Efficient Processing of Subspace Skyline Queries on High Dimensional Data'', in Proc. 2007 Int'l.Conf.on Scientific and
Statistical Database Management (SSDBM'07), Canada, July. 2007.
Searching and Mining High Dimensional Data
· Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. "Similarity Search: A Matching Based Approach", VLDB'06.
· C. Xia, H. Lu B. C. Ooi, and J. Hu "GORDER: An Efficient Method for KNN Join Processing" VLDB 2004
· Anthony K. H. Tung, Xin Xu, Beng Chin Ooi. "CURLER: Finding and Visualizing Nonlinear Correlated Clusters" In Proceedings, SIGMOD'05, Baltimore, Maryland 2005.
· Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". In Proceedings SIGMOD'05,Baltimore, Maryland 2005
Optional References:
· Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki, "CARPENTER: Finding Closed Patterns in Long Biological Datasets", In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003.
· Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang. "FARMER: Finding Interesting Rule Groups in Microarray Datasets". Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France.
· Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. "COBBLER: Combining Column and Row Enumeration for Closed Pattern Discovery". SSDBM 2004 Santorini Island Greece.
· Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining Frequent Closed Patterns in Microarray Data”. In IEEE International Conference on Data Mining, (ICDM). 2004
Searching and Mining Sequences
· Benjarath Phoophakdee, Mohammed J. Zaki: "Genome-scale disk-based suffix tree indexing". SIGMOD Conference 2007: 833-844
· Chen Li, Bin Wang, and Xiaochun Yang . "VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams". Accept and to Appear in VLDB 2007.
· nT.L. Bailey and Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc. Int. Conf. Intell. Syst. Mol. Biol., volume 2, pages 28--36. 1994
· Heikki Mannila, Christopher Meek: Global partial orders from sequential data. KDD 2000: 161-168
Optional References:
· U. Keich and P. Pevzner. Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics, in press, 2002.
· B Ma, J Tromp, M Li. PatternHunter: faster and more sensitive homology search - Bioinformatics, 2002 - Oxford Univ Press
· Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. "Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles". Accepted and to appear in ICDE 2006.
Searching and Mining Trees
· Philip Bille . A survey on tree edit distance and related problems. Theoretical Computer Science. Volume 337 , Issue 1-3 (June 2005)
· Rui Yang, Panos Kalnis, Anthony K. H. Tung: Similarity Evaluation on Tree-structured Data. SIGMOD 2005.
· Hisashi Kashima Teruo Koyanagi. Kernels for Semi-Structured Data. Proceedings of the Nineteenth International Conference on Machine Learning. 291-298. 2002.
· Mohammed J. Zaki. "Efficiently mining frequent trees in a forest". Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Pg. 71-80. 2002
Optional References:
· JP Vert. "A tree kernel to analyze phylogenetic profiles" - Bioinformatics, 2002 - Oxford Univ Press
Searching and Mining Graphs
·
Hisashi Kashima, Koji
Tsuda, Akihiro
Inokuchi: Marginalized
Kernels Between Labeled Graphs. ICML 2003:
321-328
·
Bei Yu, Guoliang Li,
Karen Sollins, Anthony K. H. Tung.
Effective Keyword-based Selection of
Relational Databases. ACM SIGMOD 2007.
·
X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”,
Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining
(KDD'03), Washington, D.C., Aug. 2003..
·
X. Yan, P. S. Yu, and
J. Han, “Substructure Similarity Search in
Graph Databases”, in Proc. 2005 ACM-SIGMOD Int.
Conf. on Management of Data (SIGMOD'05), Baltimore, Maryland, June 2005
·
HK06: Chapter 9.1
Optional References:
· H Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 1997
· H Bunke, K Shearer. A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 1998 - ece.concordia.ca