DNA Feature Recognition
Participants:
Vladimir Bajic, Rajesh Chowdhary Chuan Hock Koh,
Huiqing Liu, Limsoon Wong, Roland Yap, Fanfan Zeng
Background
Correct prediction of transcription start sites,
translation initiation sites, gene splice sites, poly-A sites, and
other functional sites from DNA sequences are important issues
in genomic research. In this project, we investigate these prediction
problems using the paradigm of ``feature generation, feature selection,
and feature integration''. There are two reasons for our interest
in such a paradigm.
The first reason is that standard tool boxes
can be identified and used for each of the 3 components. For example,
any statistical significance test can be used for feature selection.
Similarly, any machine learning method can be used for feature integration.
The main challenge is in developing a ``standard'' tool box for feature
generation suitable for DNA functional sites.
The second reason is that features that are critical to the recognition
of specific DNA functional sites are explicitly generated and selected
in this paradigm. This explicitness is helpful in understanding the
underlying biological mechanism of that DNA functional site.
Achievements
-
We investigated some of the key components of the ``feature generation,
feature selection, and feature integration'' approach to DNA feature
recognition. We identified a set of standard
features that are suitable for accurate prediction of functional sites
in DNA sequences and for understanding their underlying biological
mechanism.
-
We developed DNAFSMiner, a prediction system based on
the ``feature generation, feature selection, and feature integration''
paradigm above, that exceeds 90% sensitivity and precision for prediction of
mammalian translation initiation sites and human poly-A sites.
-
We extended the ``feature generation, feature selection, and
feature integration'' paradigm with a ``cascade'' step.
Functional sites that do not have anchor motifs cannot be easily modeled
using the previous paradigm but can be easily modeled using
this extended paradigm.
We developed Sirius PSB, a generic system for developing classifiers
based on the ``feature generation, feature selection, feature integration,
and cascading classifier'' paradigm. This powerful software allows a
user to rapidly develop accurate classifiers for various types of
functional sites in both DNA sequences and protein sequences.
-
As a demonstration of Sirius PSB, we used it to construct a recognizer
for poly-A sites in Arabidopsis genomic sequences and achieved
the highest reported equal error rates to date (95.5% on coding
sequences, 85.7% on 5'UTR sequences, and 74.6% on intronic sequences).
As a second demonstration of Sirius PSB, we also used it to predict
(and subsequently experimentally verified) many novel septal pore proteins.
-
The research also got us interested in constructing explicit
models for function-specific promoters, and to use such models to
recognized co-regulated genes of specific functions. We approached
this problem by first identifying individual features (TFBS)
in function-specific promoters, and then constructing a Bayesian
network to model the relationship of these features. We chose
histone genes as our test case.
Selected Publications
DNAFSMiner
-
Fanfan Zeng, Roland Yap, Limsoon Wong.
Using Feature Generation and Feature Selection for Accurate
Prediction of Translation Initiation Sites.
Proceedings of 13th International Conference on Genome Informatics,
pages 192--200, Tokyo, Japan, December 2002.
PPT
-
Huiqing Liu, Limsoon Wong.
Data Mining Tools for Biological Sequences.
Journal of Bioinformatics & Computational Biology,
1(1):139--168, April 2003.
PDF
-
Huiqing Liu, Hao Han, Jinyan Li, Limsoon Wong.
An in silico method for prediction of polyadenylation signals in
human sequences.
Proceedings of 14th International Conference on Genome Informatics,
pages 84--93, Yokohama, December 2003.
PDF /
Server
-
Jinyan Li, Huiqing Liu, Limsoon Wong, Roland Yap.
Techniques for Recognition of Translation Initiation Sites.
The Practical Bioinformatician, chapter 4,
pages 71--90, World Scientific, May 2004.
PS
-
Huiqing Liu, Hao Han, Jinyan Li, Limsoon Wong.
Using Amino Acid Patterns to Accurately Predict Translation
Initiation Sites.
In silico Biology, 4(3):255--269, 2004.
PDF /
Server
-
Huiqing Liu, Hao Han, Jinyan Li, Limsoon Wong.
DNAFSMiner: A Web-Based Software Toolbox to Recognize
Two Types of Functional Sites in DNA Sequences.
Bioinformatics, 21:671--673, March 2005.
PDF /
Server
Sirius PSB
-
Chuan Hock Koh, Limsoon Wong.
Recognition of Polyadenylation Sites from Arabidopsis Genomic Sequences.
Proceedings of 18th International Conference on Genome Informatics (GIW),
pages 73--82, Singapore, December 2007.
PDF /
Datasets & Source codes
- Chuan Hock Koh, Sharene Lin, Gregory Jedd, Limsoon Wong.
Sirius PSB: A Generic System for Analysis of Biological Sequences.
Journal of Bioinformatics and Computational Biology,
7(6):973-990, December 2009.
PDF /
Sirius PSB v2.1 Software /
Sirius PSB v2.33 Software /
Sirius PSB
Server
- Julian Lai, Chuan Hock Koh, Monika Tjota, Laurent Pieuchot,
Vignesh Raman, Karthik Balakrishna Chandrababu, Daiwen Yang,
Limsoon Wong, Gregory Jedd.
Intrinsically disordered proteins aggregate at fungal cell-to-cell
channels and regulate intercellular connectivity.
Proceedings of the National Academy of Sciences of the
United States of America,
109(39):15781-15786, September 2012.
PDF
Dragon Promoter Mapper
-
Rajesh Chowdhary, Limsoon Wong, Vladimir Bajic.
Finding Functional Promoter Motifs by Computational Methods:
A Word of Caution.
International Journal of Bioinformatics Research and Applications,
2(3):282--288, 2006.
-
Rajesh Chowdhary, Sin Lam Tan, R. Ayesha Ali, Brent Boerlage,
Limsoon Wong, Vladimir B Bajic.
Dragon Promoter Mapper (DPM): A Bayesian Framework for
Modeling Promoter Structures.
Bioinformatics, 22:2310--2312, 2006.
- Rajesh Chowdhary, Vladimir B. Bajic, Difeng Dong, Limsoon Wong,
Jun S. Liu.
Genome-wide analysis of regions similar to promoters of histone genes.
BMC Systems Biology, 4(Suppl 1):S4, May 2010.
Dissertations
- Huiqing Liu.
Effective Use of Data Mining Technologies on Biological
and Clinical Data.
PhD thesis, Dept of Computer Science,
National University of Singapore, Singapore, 2004.
- Rajesh Chowdhary.
A Bayesian System for Modeling Promoter Structure: A
Case Study of Histone Promoters.
PhD thesis,
Dept of Computer Science,
National University of Singapore,
Singapore, 2007.
- Koh Chuan Hock.
A Generic System for Genomic Feature Recognition.
Honours Year Project Report,
Dept of Computer Science,
National University of Singapore,
Singapore, 2008.
Selected Presentations
-
Limsoon Wong.
Using Feature Generation and Feature Selection for Accurate Prediction
of Translation Initiation Sites.
Invited talk at 1st KAIST-Singapore Joint Workshop on
Bioinformatics & Natual Language Processing.
KAIST, Daejon, South Korea. 24 February 2003.
-
Limsoon Wong.
Accurate Recognition of Translation Initiation Sites.
Invited talk at 10th Congress of the Federation of Asian
and Oceanic Biochemists and Molecular Biologists,
Bangalore, India, December 2003.
-
Limsoon Wong.
Accurate Recognition of Translation Initiation Sites, Transcription Start
Sites, and Polyadenylation Signals in Genomic Sequences.
Invited seminar for University of Electronic Science and
Technology of China, Chengdu, Sichuan, 25 May 2007.
Acknowledgements
This project is supported in part by the
I2R-SOC Joint Lab on Knowledge Discovery
from Clinical Data (7/03 - 6/07).
Last updated: 8/3/13, Limsoon Wong.