Pattern Spaces
Participants:
Guozhu Dong, Mengling Feng, Ngo Thanh Son, Terk Shuen Lee,
Jinyan Li, Guimei Liu, Limsoon Wong.
Background
There are many previous data mining works on frequent itemsets,
their closed patterns, and their generators. There are also a number
of studies on emerging patterns and their borders. But the mining
of odds ratio patterns, relative risk patterns, and patterns having
other statistical properties, have never been investigated extensively.
There is no obvious way to adapt the methods for mining frequent itemsets
or emerging patterns to obtain efficient methods for mining odds ratio
patterns, relative risk patterns, and patterns satisfying other statistical
properties. While frequent patterns are studied traditionally by the data
mining community, they are not often used medical data analysis nor in
traditional statistical analysis. In contrast, odds ratio, relative risk,
and other statistical properties are much more commonly encountered in
the analysis of medical data and in traditional statistical analysis.
Thus odds ratio patterns, relative risk patterns, and patterns satisfying
other statistical properties deserve our attention. In this project,
we would like to (a) study in depth the theoretical properties of these
patterns, (b) develop efficient algorithms for their mining, (c) develop
efficient algorithms for their incremental maintenance when the underlying
databases are updated, (d) investigate ways to build classifiers based on
them and develop techniques for visualizing and explaining decisions made
by such classifiers, and (e) their applications to biomedical data analysis.
Objectives
- Study the theoretical properties of these patterns.
In particular, we are interested in structural properties that
are useful for a more compact representation or for a greater
level comprehensibility. For example, whether the space is convex,
or is decomposable into subspaces that are convex, and whether
the decomposition is optimal. We are also interested in the changes
in such pattern spaces when the underlying databases are updated.
For example, linking changes in the pattern space to changes in the
underlying database may provide insight to cause-and-effect that is
not easy to grasp otherwise.
- Develop efficient algorithms for their mining. These patterns
satisfy more complex statistical properties (e.g., odds ratio)
then frequent patterns. Thus the complexity of naive algorithms
for mining them would be considerably higher on the average than
those for frequent patterns. We are interested to investigate if
there are efficient algorithms for mining these patterns, and if
these algorithms can be decomposed so that they share a common
core that can be easily reused or adapted for different pattern
spaces.
- Develop efficient algorithms for their incremental maintenance
when the underlying databases are updated. Databases are updated
from time to time. While it is possible to recomputed these patterns
from scratch using the updated database, it may be much more
efficient to compute the changes to these patterns based on changes
to the database. We would also like to investigate what type of
small changes in the underlying database can cause large changes
to the pattern space; such links may provide better insight to
cause-and-effect of events than other type of changes.
- Build classifiers based on them and develop techniques for
visualizing and explaining decisions. Each pattern can be thought
of as a classification rule. Thus the mining of these patterns
can be thought of as the extraction of a collection of
classification rules. We are interested to develop techniques
for constructing accurate classifiers based on these patterns/rules.
We are especially interested in the robustness of these classifiers
with respect to missing values or wrong values in training and
testing data.
Main Results
- Please see this summary poster for
the main technical results of this project.
Selected Publications
- Jinyan Li, Limsoon Wong.
Geography of Differences Between Two Classes of Data.
Proceedings 6th European Conference on Principles of
Data Mining and Knowledge Discovery,
pages 325--337, Helsinki, Finland, August 2002.
PS
-
Jinyan Li, Guozhu Dong, Kotagiri Ramamohanarao, Limsoon Wong.
DeEPs: A New Instance-based Discovery and Classification System.
Machine Learning, 54(2):99--124, 2004.
PS
-
Jinyan Li, Limsoon Wong.
Structural Geography of the Space of Emerging Patterns.
Intelligent Data Analysis, 9(6):567--588, 2005.
PDF
-
Guozhu Dong, Chunyu Jiang, Jian Pei, Jinyan Li, Limsoon Wong.
Mining Succint Systems of Minimal Generators of Formal Concepts.
Proceedings of 10th International Conference on Database Systems
for Advanced Applications,
pages 175--187, Beijing, China, April 2005.
PDF
- Haiquan Li, Jinyan Li, Limsoon Wong, Mengling Feng, Yap-Peng Tan.
Relative Risk and Odds Ratio: A Data Mining Perspective.
Proceedings of 24th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems,
pages 368--377, Baltimore, Maryland, June 2005.
PDF,
GcGrowth V1.0,
GcGrowth V2.0
GcGrowth V2.1
-
Jinyan Li, Haiquan Li, Donny Soh, Limsoon Wong.
A Correspondence Between Maximal Complete Bipartite Subgraphs
and Closed Patterns.
Proceedings of 9th European Conference on Principles and Practice
of Knowledge Discovery in Databases,
pages 146--156, Porto, Portugal, October 2005.
PDF,
FP-MBC Software
- Guimei Liu, Jinyan Li, Limsoon Wong, Wynne Hsu.
Positive Borders or Negative Borders: How to Make
Generator-Based Lossless Representation Concise,
Proceedings of SIAM Conference on Data Mining,
pages 469--473, Bethesda, Maryland, April 2006.
PDF,
GrGrowth-PBd Software
- Jinyan Li, Haiquan Li, Limsoon Wong, Jian Pei, Guozhu Dong.
Minimum Description Length Principle: Generators are
Preferrable to Closed Patterns,
Proceedings of 21st National Conference on Artificial Intelligence,
pages 409--415, Boston, Mass, July 2006.
PDF
- Guimei Liu, Jinyan Li, Kelvin Sin, Limsoon Wong.
Distance-Based Subspace Clustering with Flexible Dimension Partitioning.
Proceedings of 23rd IEEE International Conference on Data Engineering,
pages 1250--1254, Istanbul, Turkey, April 2007.
PDF,
nClusters Software, V1.0
- Mengling Feng, Guozhu Dong, Jinyan Li, Yap-Peng Tan, Limsoon Wong.
Evolution and Maintenance of Frequent Pattern
Space when Transactions are Removed.
Proceedings of 11th Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD),
pages 489--497, Nanjing, China, May 2007.
PDF,
PPT,
TRUM Software
- Jinyan Li, Guimei Liu, Limsoon Wong.
Mining Statistically Important Equivalence Classes and
Delta-Discriminative Emerging Patterns.
Proceedings of 13th International Conference on Knowledge Discovery
and Data Mining,
pages 430--439, San Jose, California, 12-15 August 2007.
PDF,
DPM Software
- Jinyan Li, Guimei Liu, Haiquan Li, Limsoon Wong.
Maximal Biclique Subgraphs and Closed Pattern Pairs of the
Adjacency Matrix: A 1-to-1 Correspondence and Mining Algorithms.
IEEE Transactions on Knowledge and Data Engineering,
19(2):1625--1637, December 2007.
PDF,
LCM-MBC Software
- Guimei Liu, Jinyan Li, Limsoon Wong.
A New Concise Representation of Frequent Itemsets
Using Generators and a Positive Border.
Knowledge and Information Systems, 17(1):35--56, October 2008.
PDF,
GrGrowth-PBd Software
- Jinyan Li, Kelvin Sim, Guimei Liu, Limsoon Wong.
Maximal Quasi-Bicliques with Balanced Noise Tolerance:
Concepts and Co-Clustering Applications.
Proceedings of SIAM Conference on Data Mining,
pages 72--83, Atlanta, Georgia, 24-26 April 2008.
PDF
- Guimei Liu, Limsoon Wong.
Effective Pruning Techniques for Mining Quasi-Cliques.
Proceedings of European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD),
part II, pages 33--49, Antwerp, Belgium, 15-19 September 2008.
PDF,
Quick Software
- Mengling Feng, Jinyan Li, Limsoon Wong, Yap-Peng Tan.
Negative Generator Border for Effective Pattern Maintenance.
Proceedings of 4th International Conference on Advanced Data Mining
and Applications,
pages 217--228, Chengdu, China, 8-10 October 2008.
(Best Paper Runner-Up at ADMA2008)
PDF,
PSM Software
- Mengling Feng, Jinyan Li, Guozhu Dong, Limsoon Wong.
Maintenance of Frequent Patterns: A Survey.
Post-Mining of Association Rules: Techniques for Effective
Knowledge Extraction,
edited by Yanchang Zhao, Chengqi Zhang, and Longbing Cao,
chapter XIV, pages 275--295,
IGI Global, May 2009.
PDF
- Guozhu Dong, Jinyan Li, Guimei Liu, Limsoon Wong.
Mining Conditional Contrast Patterns.
Post-Mining of Association Rules: Techniques for Effective
Knowledge Extraction,
edited by Yanchang Zhao, Chengqi Zhang, and Longbing Cao,
chapter XV, pages 296--311, IGI Global, May 2009.
PDF
- David Lo, Siau-Cheng Khoo, Limsoon Wong.
Non-Redundant Sequential Rules---Theory and Algorithm.
Information Systems, 34(4-5):438-453, June-July 2009.
PDF
- David Lo, Jinyan Li, Limsoon Wong, Siau-Cheng Khoo.
Mining Iterative Generators and Representative Rules for
Software Specification Discovery.
IEEE Transactions on Knowledge and Data Engineering,
23(2):282--296, February 2011.
PDF
- Mengling Feng, Guozhu Dong, Jinyan Li, Yap-Peng Tan, Limsoon Wong.
Pattern Space Maintenance for Data Updates.
Computational Intelligence, 26(3):282--317, August 2010.
PDF,
PSM Software
- Guimei Liu, Kelvin Sim, Jinyan Li, Limsoon Wong.
Efficient Mining of Distance-Based Subspace Clusters.
Statistical Analysis and Data Mining, 2(5):427--444, December 2009.
PDF,
nClusters Software, V1.1
- Thanh-Son Ngo, Mengling Feng, Guimei Liu, Limsoon Wong.
Efficiently Finding the Best Parameter for the Emerging
Pattern-based Classifier PCL.
Proceedings of 14th Pacific-Asia Conference on Knowledge Discovery and
Data Mining (PAKDD), Part I,
pages 121--133, Hyderabad, India, June 2010.
PDF,
ePCL Software, v2.0,
ePCL Software, v3.0,
PCL Experiments Environment
- Guimei Liu, Yue Wang, Limsoon Wong.
FastTagger: An Efficient Algorithm for Genome-wide Tag SNP Selection
Using Multi-marker Linkage Disequilibrium.
BMC Bioinformatics, 11:66, February 2010.
PDF
Dissertations
Selected Presentations
-
Limsoon Wong.
Convexity of Itemset Spaces.
Invited talk at Academia Sinica, Taipei, Taiwan, 28 May 2004.
PPT
- Limsoon Wong.
Theory, Practice, and an Application of Frequent Pattern
Space Maintenance.
Invited talk at 6th International Conference on Information Technology
for Education and Research (IT@EDU2010 2010),
Ho Chi Minh City, Vietnam, 18 August 2010.
PPT
Acknowledgements
This project is supported in part by
a A*STAR AGS scholarship (Feng: 4/04 - 3/08; Lee: 4/04 - 3/08), the
I2R-SOC Joint Lab on Knowledge Discovery
from Clinical Data (7/03 - 6/07),
NUS ARF Grant R-252-050-238-101/133 (11/05 - 11/08),
and SERC PSF Grant R-252-000-316-305 (8/07 - 12/10).
Last updated: 3 December 2013, Limsoon Wong.