Tools for Design of Microarray and
Analysis of Gene Expression Profiles
for Disease Diagnosis and Prognosis
Participants: Jinyan Li, Kui Lin, Guimei Liu, Huiqing Liu, V. S. Sundararajan,
Limsoon Wong, Allen Yeoh
Background
The development of microarray technology has made possible the
simultaneous monitoring of the expression of thousands of genes.
This development offers great opportunities in advancing the
diagnosis of diseases, the treatment of diseases, and the understanding
of gene functions. This project aims to:
(1) develop technologies for the design of microarrays;
(2) develop tools for the analysis of gene expression profiles,
especially for optimization of disease treatment; and
(3) apply these tools for optimization of disease treatment,
with childhood leukemias as the initial area.
Achievements
-
We developed PCL (Prediction by Collective Likelihood of Emerging Patterns).
This is a classifier that is accurate and easy to understand.
The classifier relies on an efficient mining of emerging patterns.
It uses ensemble of emerging patterns matching the test instance for
classification, and the emerging patterns can be used to explain
the prediction. It performs superbly on gene expression profile
classification problems, in conjunction with suitable gene selection
strategies.
-
We also developed CS4 (Cascading and Sharing for ensemble of decision trees).
This is a classifier that is accurate, suitable for high-dimensional data,
and easy to understand. This classifier generates multiple classification
trees and then combine their power to make prediction. The trees can be
used to explain the prediction. It performs very well on a broad range
of problems---it is generally superior to other ensemble methods such
as Boosting and Bagging, and is often comparable to SVM in accuracy.
-
We developed ERCOF (Entropy-Based Rank Sum Test and Correlation Filtering).
This is a robust feature selection strategy suitable for gene expression
profile classification. In an exhaustive test involving over 1000 experiments,
ERCOF was shown to enhanced the performance of a wide range of
classifiers better than other feature selection methods.
-
We developed the idea of ``extreme sample selection''. Instead of
using all available training samples for classifier training, we proposed
to use the so-called ``extreme samples''. For example, to learn a classifier
to predict patient survival, we would use those who survived the longest
and those who died immediately as training samples, and ignore all the
rest. This strategy often enhanced prediction performance as it
brought out the sharpness contrast in the differences between the
two classes of samples.
-
We developed nCluster, a distance-based subspace clustering model and mining
method, to find groups of objects (i.e., genes, patients) that have similar
values on subsets of dimensions. Traditional similarity or distance
measurements usually become meaningless when the dimensions of the datasets
increase, which has detrimental effects on clustering performance.
Instead of using a grid based approach to partition the data
space into non-overlapping rectangle cells as in the density
based subspace clustering algorithms, the nCluster model
uses a more flexible method to partition the dimensions to
preserve meaningful and significant clusters.
- We applied these techniques to a variety of gene expression profile
classification problems. Our most successful example was the prediction
of subtypes of childhood acute lymphoblastic leukemias, and the optimization
of treatment of this disease based on the predicted subtype. This work
was estimated to raise survival rates of childhood acute lymphoblastic
leukemias patients in ASEAN countries from about 20% to 80%,
with reduced side effects and relapse, and saved about US$50 million
annually in medical costs.
- We also helped to design and develop the world's first fission yeast
whole-genome profiling microarray.
- Won multiple awards. In particular,
- The 2003 Asian Innovaton Award. This award is
Asia's premier honour for individuals who come up with new ideas,
methods or technologies to improve the quality of life.
- The 2006 Singapore Youth Award Medal of Commendation.
- PCL and CS4 have been commercialized through
KOOPrime.
Selected Publications
-
Jinyan Li, Limsoon Wong.
Emerging Patterns and Gene Expression Data.
Proceedings of 12th Workshop on Genome Informatics,
Tokyo, Japan, December 2001, pages 3--13.
PS
-
Jinyan Li, Limsoon Wong.
Identifying good diagnostic genes or genes groups from gene
expression data by using the concept of emerging patterns.
Bioinformatics, 18:725--734, 2002.
Corrigendum. Bioinformatics, 18:1407--1408, 2002.
PS
-
Eng-Juh Yeoh, Mary E. Ross, Sheila A. Shurtleff, W. Kent William,
Divyen Patel, Rami Mahfouz, Fred G. Behm, Susana C. Raimondi,
Mary V. Reilling, Anami Patel, Cheng Cheng, Dario Campana,
Dawn Wilkins, Xiaodong Zhou, Jinyan Li, Huiqing Liu, Chin-Hon Pui,
William E. Evans, Clayton Naeve, Limsoon Wong, James R. Downing.
Classification, subtype discovery, and prediction of outcome in
pediatric acute lymphoblastic leukemia by gene expression
profiling.
Cancer Cell, 1:133--143, March 2002.
PDF
-
Lance D. Miller, Phillip M. Long, Limsoon Wong, Sayan Mukherjee,
Lisa M. McShane, Edison T. Liu.
Optimal gene expression analysis by microarrays.
Cancer Cell, 2:353--361, November 2002. (Reviewed invited paper)
PDF
-
Jinyan Li, Limsoon Wong.
Geography of Differences Between Two Classes of Data.
Proceedings 6th European Conference on
Principles of Data Mining and Knowledge Discovery,
pages 325--337, Helsinki, Finland, August 2002.
PS
-
Jinyan Li, Limsoon Wong.
Solving the Fragmentation Problem of Decision Trees
by Discoverying Boundary Emerging Patterns.
Proceedings of IEEE International Conference on Data Mining,
pages 653--656, Maebashi City, Japan, December 2002.
PS
-
Jinyan Li, Huiqing Liu, Limsoon Wong.
A comparative study on feature selection and classification methods
using a large set of gene expression profiles.
Proceedings of 13th International Conference on Genome Informatics,
pages 51--60, Tokyo, Japan, December 2002.
PS
-
Jinyan Li, Huiqing Liu, James R. Downing, Allen Eng-Juh Yeoh, Limsoon Wong.
Simple Rules Underlying Gene Expression Profiles of More than Six Subtypes of Acute Lymphoblastic Leukemia (ALL) Patients.
Bioinformatics, 19:71--78, 2003.
PS
-
Jinyan Li, Huiqing Liu, See-Kiong Ng, Limsoon Wong.
Discovery of Significant Rules for Classifying Cancer Diagnosis Data.
Bioinformatics, 19(suppl. 2):ii93--ii102, September 2003.
-
Jinyan Li, Limsoon Wong.
Using Rules to Analyse Bio-medical Data:
A Comparison between C4.5 and PCL.
Proceedings of 4th International Conference on
Web-Age Information Management,
pages 254--265, Chengdu, PRC, August 2003.
PS
-
Jinyan Li, Huiqing Liu, Limsoon Wong.
Mean-entropy discretized features are effective for
classifying high-dimensional biomedical data.
Proceedings of 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics,
pages 17--24, Washington, DC, August 2003.
-
See-Kiong Ng, Soon-Heng Tan, V. S. Sundararajan.
On Combining Multiple Microarray Studies for Improved Functional
Classification by Whole-Dataset Feature Selection.
Genome Informatics, 14:44-53, 2003.
PDF
-
Jinyan Li, Guozhu Dong, Kotagiri Ramamohanarao, Limsoon Wong.
DeEPs: A New Instance-based Discovery and Classification System.
Machine Learning, 54(2):99--124, 2004.
PS
-
Jinyan Li, Limsoon Wong.
Techniques for Analysis of Gene Expression.
The Practical Bioinformatician,
chapter 14, pages 319--346, World Scientific, May 2004.
-
Kui Lin, Jianhua Liu, Lance Miller, Limsoon Wong.
Genome-Wide cDNA Oligo Probe Design and its Applications
in Schizosaccharomyces Pombe.
The Practical Bioinformatician,
chapter 15, pages 347--358, World Scientific, May 2004.
-
Jinyan Li, Huiqing Liu, Limsoon Wong.
Use of Built-in Features in the Interpretation of High-Dimensional
Cancer Diagnosis Data.
Proceedings of 2nd Asia Pacific Bioinformatics Conference,
pages 67--74, Dunedin, New Zealand, January 18-22, 2004.
PS
-
Huiqing Liu, Jinyan Li, Limsoon Wong.
Selection of Patient Samples and Genes for Outcome Prediction.
IEEE Bioinformatics Proceedings (CSB2004),
pages 382--392, Stanford, CA, August 2004.
PS
-
Guozhu Dong, Jinyan Li, Limsoon Wong.
The Use of Emerging Patterns in the Analysis of Gene Expression
Profiles for the Diagnosis and Understanding of Diseases.
New Generation of Data Mining Applications,
chapter 14, pages 331--354. John Wiley, April 2005.
PDF
-
Xu Peng, R. Krishna Murthy Karuturi, Lance D. Miller, Kui Lin,
Yonghui Jia, Pinar Konda, Long Wang, Limsoon Wong, Edison T. Liu,
Mohan K. Balasubramanian, Jianhua Liu.
Identification of Cell Cycle-regulated Genes in Fission Yeast.
Molecular Biology of the Cell, 16(3):1026--1042, March 2005.
PDF
-
Huiqing Liu, Jinyan Li, Limsoon Wong.
Use of Extreme Patient Samples for Outcome Prediction
from Gene Expression Data.
Bioinformatics, 21(16):3377--3384, 2005.
PDF
-
Jinyan Li, Limsoon Wong.
Structural Geography of the Space of Emerging Patterns.
Intelligent Data Analysis, 9(6):567--588, 2005.
PDF
- Guimei Liu, Jinyan Li, Kelvin Sin, Limsoon Wong.
Distance-Based Subspace Clustering with Flexible Dimension Partitioning.
Proceedings of 23rd IEEE International Conference on Data Engineering,
pages 1250--1254, Istanbul, Turkey, April 2007
PDF,
nClusters Software, V1.0
- Guimei Liu, Kelvin Sim, Jinyan Li, Limsoon Wong.
Efficient Mining of Distance-Based Subspace Clusters.
Statistical Analysis and Data Mining, 2(5):427--444, December 2009.
nClusters Software, V1.1
Dissertations
Selected Presentations
-
A. Yeoh, K. Williams, D. Patel, S. Shurtleff, E.Behm, S. Raimondi,
M. Relling, C. Cheng, D. Wilkins, L. Wong, W. Evans, C.-H. Pui,
C. Naeve, J.R. Downing.
Expression Profiling of Pediatric Acute Lymphoblastic Leukemia (ALL)
Blasts at Diagnosis Accurately Predicts Both the Risk of
Relapse and of Developing Therapy-Induced Acute Myeloid
Leukemia (AML).
Blood, 98(11):1816 Part 1, November 2001.
Plenary talk at American Society of Hematology 43rd Annual Meeting,
Orlando, Florida, December 2001.
-
Limsoon Wong.
Some Questions on Assessing the Quality of Gene Expression Class
Prediction Results.
Invited talk at the Post-Genome Knowledge Discovery Program,
Workshop on Sequence and Gene Expression Analysis,
Institute for Mathematical Sciences, Singapore, January 2002.
-
Jinyan Li, Limsoon Wong.
Identifying Good Diagnostic Gene Groups from Gene Expression Profiles
Using the Gene Expression Profiles Using the Concept of
Emerging Patterns.
Invited talk at Singapore MicroArray Meeting 2002,
National Cancer Centre, Singapore, January 2002.
-
Jinyan Li, Huiqing Liu, Limsoon Wong.
Emerging Patterns: A New Approach to Analysing Gene Expression Data.
Talk and poster at HUGO 7th International Human Genome Meeting,
Shanghai, April 2002.
-
Limsoon Wong.
Identifying Good Diagnostic Gene Groups from Gene Expression Profiles
Using the Concept of Emerging Patterns.
Invited talk at University of Queensland, Brisbane, 10 February 2003.
PPT
-
Limsoon Wong.
Data Mining of Gene Expression Profiles for the Diagnosis
and Understanding of Diseases.
Keynote address at 7th International Database Engineering
and Application Symposium, Hong Kong, July 2003.
PPT
-
Limsoon Wong.
Gene Expression Profile for the Diagnosis and Understanding of Diseases.
Invited talk at DuNmAn Life-Sciences & Technology Day,
Dunman Secondary School, Tampines, 2 August 2003.
-
Limsoon Wong.
Diagnosis of Childhood Acute Lymphoblastic Leukaemia
and Optimization of Risk-Benefit Ratio of Therapy.
Invited talk at 1st Annual Symposium of Association of
Asian Societies for Bioinformatics,
Yokohama, Japan, December 2003.
PS,
PPT
-
Limsoon Wong.
Diagnosis of Childhood Acute Lymphoblastic Leukaemia and
Optimization of Risk-Benefit Ratio of Therapy.
Invited talk at NTU School of Computer Engineering, 24 November 2003.
-
Limsoon Wong.
Diagnosis of Childhood Acute Lymphoblastic Leukaemia
and Optimization of Risk-Benefit Ratio of Therapy.
Invited talk at National Cheng Kung University,
Tainan, Taiwan, 21 May 2004.
-
Limsoon Wong.
Diagnosis of Childhood Acute Lymphoblastic Leukaemia
and Optimization of Risk-Benefit Ratio of Therapy.
Invited talk at Edinburgh University, Edinburgh, 1 October 2004.
-
Limsoon Wong.
Knowledge Discovery in Biomedicine.
Invited talk at National Healthcare Group Annual Scientific Congress,
Raffles Convention Centre, Singapore, October 2004.
PPT
-
Limsoon Wong.
Selection of Patient Samples and Genes for Disease Prognosis.
Invited talk at "Informatics Inspired Biology" Symposium,
BioPolis, Singapore, 16 January 2005.
PPT
-
Limsoon Wong.
The Bright and Dark Side of Data Mining Research.
Invited panel discussion at PAKDD 2006,
Hilton Hotel, Singapore, 11 April 2006.
PPT
Acknowledgements
This project is supported in part by
NSTB grant LS/99/001/B,
the I2R-SOC Joint Lab on Knowledge Discovery from
Clinical Data,
NUS ARF grant R-252-050-238-101/133, and
SERC PSF grant 072 101 0016.
Last updated: 16/9/12, Limsoon Wong.