Mass spectrometry (MS)-based proteomics is a powerful tool for profiling systems-wide protein expression changes. It can be applied for various purposes, e.g., biomarker discovery in diseases and study of drug responses. However, MS-based proteomics tend to have consistency (poor reproducibility and inter-sample agreement) and coverage (inability to detect the entire proteome) issues that need to be urgently addressed. In this project, we aim to deal with the two challenges above by proposing approaches that analyze proteomic profiles in the context of biological networks.
We find existing works on gene expression analysis fall short on several issues: these works provide little information on the interplay between selected genes; the collection of pathways that can be used, evaluated, and ranked against the observed expression data is limited; and a comprehensive set of rules for reasoning about relevant molecular events has not been compiled and formalized. We thus envision, in this project, a more advanced integrated framework to provide biologically inspired solutions for these challenges.
Phylogenetic tree is used to study the evolutionary relationship among a set of taxa. It has been used in many different biological areas. In our project, we aim to develop methods for constructing and comparing phylogenetic trees and networks.
DNA sequences, which hold the code of life for every living organism, can be represented by strings over 4 characters A, C, G, and T. Due to advances in bio-technology, we already know the complete sequences for a number of living organisms, including human. Advances in sequencing also generate many sequencing data. We face two problems. First, the dataset is too big. It is easily to generate hundreds of billions of bytes of data per individual. Second, to do analysis, biologists require tools that can locate the positions of an arbitrary pattern over a long DNA sequence efficiently. However, genomic data is long. It is time consuming to search a parttern by linearly scan the DNA sequence. To resolve the issue of hugh amount of data, we can use compression techniques. To resolve the pattern searching problem, we can use data-structure. Now, we need to resolve two issues at the same time. In this project, we aim to create indexing data-structures that are compressed.
The process of controlling the expression of genes is known as gene regulation. Gene regulation dictates when, where (in what tissue(s)), and how much of a particular protein is produced. This decides the development of cells and their responses to external stimuli. The most direct control mechanism is transcription regulation, which controls whether the transcription process of a gene should be initiated. In eukaryotic cells, RNA-polymerase II is responsible for the transcription process. However, it is incapable of initiating transcription on its own. It does so with the assistance of a number of DNA-binding proteins called transcription factors (TFs). TFs bind the DNA sequence and interact to form a pre-initiation complex (PIC). RNA-polymerase II is recruited in the PIC, and then the transcription begins. The crucial point of the regulation mechanism is the binding of TFs to DNA. Disruptions in gene regulation are often linked to a failure of TF binding, either due to mutation of the DNA binding site, or due to mutation of the TF itself. Due to advances in next generation sequencing, a number of techniques have been developed for studying gene regulation. They include ChIP-seq, ChIA-PET, Hi-C, etc. In this project, we develop computational tools for analyzing next-generation sequencing data. We also collaborate with biologists to make new biological discovery.
Plant metabolites are compounds synthesized by plants for essential functions, such as growth and development (primary metabolites, such as lipid), and specific functions, such as pollinator attraction and defense against herbivores (secondary metabolites). Many of them are also used directly, or as derivatives, to treat a wide range of diseases for humans. There is thus much interest to study the biosynthesis of different plant metabolites and improve their yield. In this project, we apply next-generation sequencing techniques to investigate lipid and secondary metabolisms, as well as to identify relevant DNA variations, in important crops.
Protein interaction networks resulting from high-throughput assays are still essentially an in vitro scaffold. Further progress in computational analyses techniques and experimental methods is needed to reliably deduce in vivo protein interactions, to distinguish between permanent and transient interactions, to distinguish between direct protein binding from membership in the same protein complex and to distinguish protein complexes from functional modules. We hope to develop in this project a robust and powerful system to postprocess results of high-throughput PPI assays, as well as integrating extensive annotation information, to yield a more informative protein interactome beyond a mere in vitro scaffold.
More and more data have been accumulated and stored in digital format in various applications. These data provide rich sources for making new discoveries. Data mining has become an important tool to transform data into knowledge. Finding useful and actionable knowledge is the main objective of diagnostic data mining. Most existing works tackle the problem by discovering patterns and rules and then studying their interestingness. In this work, we use a different paradigm which represents the discovered knowledge in the form of hypotheses. A hypothesis involves a comparison of two or more samples, which is more or less similar to how human obtain knowledge. Compared with patterns and rules, hypotheses provide the context in which a piece of information is interesting, thus hypotheses are more intuitive and informative than patterns and rules. More importantly, users can take actions more easily based on what a hypothesis indicates. We further analyse the discovered significant hypotheses and identify the reasons behind them so that users not only get to know what is happening but also have some rough ideas on when or why it is happening. This new data mining paradigm has the potential to make diagnostic data mining as successful as predictive data mining in real-life applications. In the proposed research, we will (1) formulate the problem and identify the issues that need to be addressed; (2) develop algorithms to solve the problem; (3) visualize the discovered knowledge to make the system easy to use; (4) interact and cooperate with domain experts in the biomedical area or other areas, and use the developed techniques to solve real-life problems.
There is a critical need to address the emergence of drug resistant varieties of pathogens for several infectious diseases. For example, drug-resistant tuberculosis has continued to spread internationally and is now approaching critical proportions. Approaches to counter drug resistance have so far achieved limited success. It has been proposed that this lack of success is due to a lack of understanding of how resistance emerges in bacterial upon drug treatment and that a systems-level analysis of the proteins and interactions involved is essential to gaining insights into routes required for drug resistance. In this project, we propose to deal with the challenges in the systems-level analysis of proteins and interactions in pathogens of infectious diseases for identifying drug resistance pathways. Furthermore, we plan to use M. tuberculosis as a test case.
Human genome harbors millions of common single nucleotide polymorphisms (SNPs) and other types of genetic variations. These genetic variations play an important role in understanding the correlation between genetic variations and human diseases and the body's responses to prescribed drugs. The discovery of such genetic factors contributing to variations in drug response, efficiency, and toxicity has come to be known as pharmacogenomics. In this project, we explore several pharmacogenomic-related applications of database and datamining technologies.
A large-scale epidemic simulation model of Singapore will be constructed in this project by taking demographic, social contact, and geographic factors into consideration. The project objectives include data collection, simulation and modelling, as well as strategy development for epidemic control.
Protein conformational changes play a critical role in vital biological functions. Due to noise in data, deteremining salient conformational changes accurately and efficiently is a challenging problem. We have developed an efficient algorithm for analyzing conformational changes of a protein, given its structures in two different conformations. A key element of the algorithm is a statistical test that determines the similarity of two protein structures in the presence of noise. Using data from the Protein Data Bank and the Macromolecular Movements Database, we tested the algorithm on proteins that exhibit a range of different conformational changes. The results show that our algorithm can reliably detect salient conformational changes, including well-known examples such as hinge and shear.
Many interesting properties of molecular motion are best characterized statistically by considering an ensemble of motion pathways rather than an individual one. Classic simulation techniques, such as the Monte Carlo method and molecular dynamics, generate individual pathways one at a time and are easily "trapped" in the local minima of the energy landscape. They are computationally inefficient if applied in a brute-force fashion to deal with many pathways. We introduce Stochastic Roadmap Simulation (SRS), a randomized technique for sampling molecular motion and exploring the kinetics of such motion by examining multiple pathways simultaneously.
We are interested in the combinatorial and algorithmic aspects of bioinformatics problems. Our current research include microarray probe design, pattern searching, genome annotation, protein structure prediction, etc. Some of our results: FindProbe, BAYESPROT, GeneNet, CSA and SAGA Programs, LGSFAligner---A Tool to Align Two RNA Secondary Structures, and Similarity Search Using Spaced Seeds.
Biological databases contain thousands of bio-sequences and also an increase amount of 3D structures of molecules. It would be hard for a biologist to search for required sequences or structures (for analysis, prediction, reasoning, etc). The data size is so large that it is even impossible for naive computer algorithms to do this job. So, we need some "smart" algorithms for both exact matching and similarity matching on sequences and 3D structures. Many techniques have been developed by many researchers to solve the above mentioned problems. But perfect solutions have not been come out yet. Our objective is to develop an optimal solution, at least for some particular bioinformatics applications.
Gene expression data can be a valuable tool in the understanding of enes, biological networks, and cellular states. One ambitious goal in analyzing expression data is to try to determine how one particular gene is affected by the expression of other genes, thus deriving the gene network. Gene expression data can also be used to determine what genes are expressed as a result of certain cellular conditions. Such kinds of knowledge can help in disease diagnosis. Gene expression data typically contain a large number of columns (genes) and a small number of rows (samples). The high-dimensional properties of gene expression data present great difficulties to existing mining algorithms. We are investigating a number of techniques for mining rules and patterns, identifying clusters and classifying gene expression data.
Computational Systems Biology involves studying cellular functions and its components at varying degrees of granularity. These levels range from the nano-scale molecular structures (atomic level) to entire organs such as heart and lungs (phenotype level). Our research focus is mainly on the functional aspects of cellular components, in the form of biopathways. We are especially interested in modeling and analyzing Signaling Pathways and Gene Regulatory Networks. Currently we have joint projects with the Genome Institute of Singapore and the Department of Biochemistry, NUS, modeling various pathways that are involved in important cell processes such as differentiation and apoptosis. Using these pathways as examples, we hope to be able to develop a set of tools and modeling methodology to produce accurate models that can be validated and can be used to predict new phenomena.
There are many previous data mining works on frequent itemsets, their closed patterns, and their generators. There are also a number of studies on emerging patterns and their borders. But the mining of odds ratio patterns, relative risk patterns, and patterns having other statistical properties frequently used in analysis of biomedical data, have never been investigated extensively. Thus odds ratio patterns, relative risk patterns, and patterns satisfying other statistical properties deserve our attention. In this project, we would like to (a) study in depth the theoretical properties of these patterns, (b) develop efficient algorithms for their mining, (c) develop efficient algorithms for their incremental maintenance when the underlying databases are updated, (d) investigate ways to build classifiers based on them and develop techniques for visualizing and explaining decisions made by such classifiers; and (e) to apply them to biomedical data.
In this project, we investigate and develop graph-based methods for inferring protein functions without sequence homology. Most approaches in predicting protein function from protein-protein interaction data utilize the observation that a protein often share functions with proteins that interacts with it (its level-1 neighbors). However, proteins that interact with the same proteins (i.e. level-2 neighbors) may also have a greater likelihood of sharing similar physical or biochemical characteristics. We are interested to find out how significant is functional association between level-2 neighbors and how they can be exploited for protein function prediction. We will also investigate how to integrate protein interaction information with other types of information to improve the sensitivity and specificity of protein function prediction, especially in the absence of sequence homology.
Progress in high-throughput experimental techniques in the past decade has resulted in a rapid accumulation of protein-protein interaction (PPI) data. However, recent surveys reveal that interaction data obtained by the popular high-throughput assays such as yeast-two-hybrid experiments may contain as much as 50% false positives and false negatives. As a result, further carefully-focused small-scale experiments are often needed to complement the large-scale methods to validate the detected interactions. However, the vast interactomes require much more scalable and inexpensive approaches. Thus it would be useful if the list of protein-protein interactions detected by such high-throughput assays could be prioritized in some way. Advances in computational techniques for assessing the reliability of protein-protein interactions detected by such high-throughput methods are explored in this project, especially those rely only on topological information of the protein interaction network derived from such high-throughput experiments.
While the first miRNAs were discovered using experimental methods, experimental miRNA identification remains technically challenging and incomplete. This calls for the development of computational approaches to complement experimental approaches to miRNA gene identification. We propose in this project to investigate de novo miRNA precursor prediction methods. We follow the "generation, feature selection, and feature integration" paradigm of constructing recognition models for genomics sequences. We generate and identified features based on information in both primary sequence and secondary structure, and use these features to construct decision models for the recognition of miRNA precursors. In addition, analyzing the binding of miRNA to their mRNA target sites reveals that many different factors determine what constitutes a good fit. We thus intend to investigate these factors in detail and to construct decision models for predicting miRNA targets. Finally, we would like to understand the role of miRNAs in a number of human diseases. In particular, we plan to begin our analysis with genes involved in muscular dystrophy, as this group of genes are among the largest and most complex-structured human genes.
The human pregnane X receptor (hPXR) is a nuclear receptor that binds to various ligands, regulating the breakdown of drugs in the human body. To study drug-drug interactions, we investigated a method for predicting potential ligand binding conformations in the binding pocket of hPXR.
This project is related to RNA structure prediction and comparison. There may be more than one structure with the optimum free energy, or there may be many structures within 5% to 10% of the minimum free energy, and these may be topologically very different. Inferring what structure is truly representative of the natural structure requires additional information. When a set of homologous sequences has a certain structure in common, this structure can be deduced by comparing the structures possible from their sequences. This assumption is very reasonable, since it is unlikely that a molecule will undergo a total change in structure during its evolution, and still be functional. The process of random mutation and selection "tests" a large range of possible sequences, and those that do not have the functional structure necessary for survival are discarded. The model made is corroborated by every new sequence determined, which fits the perceived structure.
The development of microarray technology has made possible the simultaneous monitoring of the expression of thousands of genes. This development offers great opportunities in advancing the diagnosis of diseases, the treatment of diseases, and the understanding of gene functions. This project aims to: (1) develop technologies for the design of microarrays; (2) develop tools for the analysis of gene expression profiles, especially for optimization of disease treatment; and (3) apply these tools for optimization of disease treatment, with childhood leukemias as the initial area.
Correct prediction of transcription start sites, translation initiation sites, gene splice sites, poly-A sites, and other functional sites from DNA sequences are important issues in genomic research. In this project, we investigate these prediction problems using the paradigm of ``feature generation, feature selection, and feature integration''. There are two reasons for our interest in such a paradigm. The first reason is that standard tool boxes can be identified and used for each of the 3 components. For example, any statistical significance test can be used for feature selection. Similarly, any machine learning method can be used for feature integration. The main challenge is in developing a ``standard'' tool box for feature generation suitable for DNA functional sites. The second reason is that features that are critical to the recognition of specific DNA functional sites are explicitly generated and selected in this paradigm. This explicitness is helpful in understanding the underlying biological mechanism of that DNA functional site.
A large part of the information required for biology research can only be found in free-text form, as in MEDLINE abstracts, or in comment fields of relevant reports, as in GenBank feature table annotations. This information is important for many types of analysis, such as classification of proteins into functional groups, discovery of new functional relationships, maintenance of information on material and methods, extraction of protein interaction information, and so on. However, information in free-text form is very difficult for automated systems to use. The project investigates techniques and applications of natural language processing to the extraction of biological information from free text.
Kleisli is a data transformation and integration system that can be used for any application where the data is typed, and has proven especially useful for bioinformatics applications. It extends the conventional flat relational data model supported by the query language SQL to a complex object data model supported by the collection programming language CPL. It also opens up the closed nature of commercial relational data management systems to an easily extensible system that performs complex transformations on autonomous data sources that are heterogeneous and geographically dispersed.