We are interested in the combinatorial and algorithmic aspects of bioinformatics problems. Our current research include microarray probe design, pattern searching, genome annotation, protein structure prediction, etc. Some of our results: FindProbe, BAYESPROT, GeneNet, CSA and SAGA Programs, LGSFAligner---A Tool to Align Two RNA Secondary Structures, and Similarity Search Using Spaced Seeds.
Biological databases contain thousands of bio-sequences and also an increase amount of 3D structures of molecules. It would be hard for a biologist to search for required sequences or structures (for analysis, prediction, reasoning, etc). The data size is so large that it is even impossible for naive computer algorithms to do this job. So, we need some "smart" algorithms for both exact matching and similarity matching on sequences and 3D structures. Many techniques have been developed by many researchers to solve the above mentioned problems. But perfect solutions have not been come out yet. Our objective is to develop an optimal solution, at least for some particular bioinformatics applications.
Gene expression data can be a valuable tool in the understanding of enes, biological networks, and cellular states. One ambitious goal in analyzing expression data is to try to determine how one particular gene is affected by the expression of other genes, thus deriving the gene network. Gene expression data can also be used to determine what genes are expressed as a result of certain cellular conditions. Such kinds of knowledge can help in disease diagnosis. Gene expression data typically contain a large number of columns (genes) and a small number of rows (samples). The high-dimensional properties of gene expression data present great difficulties to existing mining algorithms. We are investigating a number of techniques for mining rules and patterns, identifying clusters and classifying gene expression data.
There are many previous data mining works on frequent itemsets, their closed patterns, and their generators. There are also a number of studies on emerging patterns and their borders. But the mining of odds ratio patterns, relative risk patterns, and patterns having other statistical properties frequently used in analysis of biomedical data, have never been investigated extensively. Thus odds ratio patterns, relative risk patterns, and patterns satisfying other statistical properties deserve our attention. In this project, we would like to (a) study in depth the theoretical properties of these patterns, (b) develop efficient algorithms for their mining, (c) develop efficient algorithms for their incremental maintenance when the underlying databases are updated, (d) investigate ways to build classifiers based on them and develop techniques for visualizing and explaining decisions made by such classifiers; and (e) to apply them to biomedical data.
In this project, we investigate and develop graph-based methods for inferring protein functions without sequence homology. Most approaches in predicting protein function from protein-protein interaction data utilize the observation that a protein often share functions with proteins that interacts with it (its level-1 neighbors). However, proteins that interact with the same proteins (i.e. level-2 neighbors) may also have a greater likelihood of sharing similar physical or biochemical characteristics. We are interested to find out how significant is functional association between level-2 neighbors and how they can be exploited for protein function prediction. We will also investigate how to integrate protein interaction information with other types of information to improve the sensitivity and specificity of protein function prediction, especially in the absence of sequence homology.
Progress in high-throughput experimental techniques in the past decade has resulted in a rapid accumulation of protein-protein interaction (PPI) data. However, recent surveys reveal that interaction data obtained by the popular high-throughput assays such as yeast-two-hybrid experiments may contain as much as 50% false positives and false negatives. As a result, further carefully-focused small-scale experiments are often needed to complement the large-scale methods to validate the detected interactions. However, the vast interactomes require much more scalable and inexpensive approaches. Thus it would be useful if the list of protein-protein interactions detected by such high-throughput assays could be prioritized in some way. Advances in computational techniques for assessing the reliability of protein-protein interactions detected by such high-throughput methods are explored in this project, especially those rely only on topological information of the protein interaction network derived from such high-throughput experiments.
While the first miRNAs were discovered using experimental methods, experimental miRNA identification remains technically challenging and incomplete. This calls for the development of computational approaches to complement experimental approaches to miRNA gene identification. We propose in this project to investigate de novo miRNA precursor prediction methods. We follow the "generation, feature selection, and feature integration" paradigm of constructing recognition models for genomics sequences. We generate and identified features based on information in both primary sequence and secondary structure, and use these features to construct decision models for the recognition of miRNA precursors. In addition, analyzing the binding of miRNA to their mRNA target sites reveals that many different factors determine what constitutes a good fit. We thus intend to investigate these factors in detail and to construct decision models for predicting miRNA targets. Finally, we would like to understand the role of miRNAs in a number of human diseases. In particular, we plan to begin our analysis with genes involved in muscular dystrophy, as this group of genes are among the largest and most complex-structured human genes.