Data Mining Algorithms for Pharmacogenomics

Participants: Pauline Chen, Coral Lai, Tze Yun Leong, Lin Li, Guimei Liu, Yue Wang, Limsoon Wong.

Background

Human genome harbors millions of common single nucleotide polymorphisms (SNPs) and other types of genetic variations. These genetic variations play an important role in understanding the correlation between genetic variations and human diseases and the body's responses to prescribed drugs. The discovery of such genetic factors contributing to variations in drug response, efficiency, and toxicity has come to be known as pharmacogenomics. In this project, we explore several pharmacogenomic-related applications of database and datamining technologies.

Objectives

We focus on SNPs and how they affect drug response. We target ethnic diversity as an important aspect. We propose to integrate drug-enzyme interaction data, enzyme-SNP data, and various HapMap-type data into a web-based bioinformatics tool that allows users to search for possible variations that may be significant in determining drug response. We aim to be able to search for drug-enzyme relationships and to supplement current incomplete databases with text mining.
Genotyping all SNPs are very expensive. Fortunately, adjacent SNPs are often not independent. It is thus desirable to select a subset of SNPs that are sufficient to infer all the other SNPs. These selected SNPs are called tag SNPs. We propose algorithms to select tag SNPs based on multi-marker correlations. We aim to be many times faster, consume much less memory, and also reduce the number of selected tag SNPs, than existing tag SNP selection algorithms. At the same time, we also develop techniques to use tagging rules (discovered in the process of tag SNP selection) to impute untyped SNPs at significantly higher accuracy and sensitivity than existing methods.
The identification of disease-causing gene locations is an important topic that has significant impact on patient management decisions. The process of finding disease gene locations through comparisons of marker allele frequencies between disease chromosomes and control chromosomes is known as linkage disequilibrium mapping. We propose algorithms to infer disease gene location. We aim to consistently produce good predictive accuracies under different conditions, including extreme conditions where the occurrence of disease samples with the mutation of interest is very low and very noisy. We also aim to be fast and model free.

Selected Publications

Li Lin, Limsoon Wong, Tzeyun Leong, Pohsan Lai. LinkageTracker: A Discriminative Pattern Tracking Approach to Linkage Disequilibrium Mapping. Proceedings of 10th International Conference on Database Systems for Advanced Applications, pages 30--42, Beijing, China, April 2005. PDF
Li Lin, Limsoon Wong, Tzeyun Leong, Pohsan Lai. ECTracker---An Efficient Algorithm for Haplotype Analysis and Classification. Proceedings of 12th Triennial Medinfo 2007 Congress, pages 1270--1274, Brisbane, Australia, 21-24 August 2007.
Guimei Liu, Yue Wang, Limsoon Wong. FastTagger: An Efficient Algorithm for Genome-Wide Tag SNP Selection. BMC Bioinformatics, 11:66, February 2010. PDF, FastTagger V1.0
Li Lin, Limsoon Wong, Tze-Yun Leong, Poh San Lai. Efficient Mining of Haplotype Patterns for Linkage Disequilibrium Mapping. Journal of Bioinformatics and Computational Biology, 8(Suppl. 1):127--146, December 2010. PDF
Zhengkui Wang, Yue Wang, Kian-Lee Tan, Limsoon Wong, Divyakant Agrawal. CEO: A Cloud Epistasis cOmputing model in GWAS. Proceedings of 4th IEEE International Conference on Bioinformatics & Biomedicine, pages 85--90, Hong Kong, December 2010. PDF
Zhengkui Wang, Yue Wang, Kian-Lee Tan, Limsoon Wong, Divyakant Agrawal. eCEO: An efficient Cloud Epistasis cOmputing model in genome-wide association study. Bioinformatics, 27(8):1045--1051, April 2011. PDF, Supplementary Data, eCEO V1.0.
Yue Wang, Guimei Liu, Mengling Feng, Limsoon Wong. An Empirical Comparison of Several Recent Epistatic Interaction Detection Methods. Bioinformatics, 27(21):2936--2943, November 2011. Corrigendum. Bioinformatics, 28(1):147--148, January 2012. PDF
Yue Wang, Wilson Goh, Limsoon Wong, Giovanni Montana. Random forests on Hadoop for genome-wide studies of multivariate neuroimaging phenotypes. BMC Bioinformatics, 14(Suppl 16):S6, October 2013. PDF

Dissertations

Li Lin, Efficient Mining of Haplotype Patterns for Disease Prediction. PhD thesis, School of Computing, National University of Singapore, 2008.
Wang Yue, "Efficient Computational Techniques for Tag SNP Selection, Epistasis Analysis, and Genome-Wide Association Study". PhD thesis, NUS Graduate School of Integrative Sciences and Engineering, National University of Singapore, 2012.
Jieqi Pauline Chen, SNP Data Integration and Analysis for Drug-Response Biomarker Discovery. Honours Year Project Report, School of Computing, National University of Singapore, 2009.

Selected Presentations

Limsoon Wong. Tag SNP Selection and Disease Gene Location Inference. Invited talk at Hong Kong University, Hong Kong, 12 May 2009. PPT
Limsoon Wong. A Few Simple Ideas for Efficient Tag SNP Selection. Invited talk at Peking University, Beijing, China, 22 July 2009.
Limsoon Wong. Epistasis Testing on the Cloud. Invited talk at 22nd FAOBMB Conference, Biopolis, Singapore, 5 October 2011. PPT

Acknowledgements

This project is supported in part by NUS ARF grant R-252-050-238-101/133 (Wong: 11/05 - 11/08), SERC PSF grant 072-101-0016 (Liu, Wong: 8/07 - 12/10), and an NGS scholarship (Wang).

Last updated: 13/8/13, Limsoon Wong.