Below are materials for the 4 sessions led by Wong Limsoon in Jan 2025
Protein function prediction and some lessons for classifier performance evaluation
Generally, if the sequence of two proteins are quite similar, they would have a common ancestor and would have inherited their function from that ancestor. Thus, if one knows the function of one of these two proteins, one can infer the function of the other protein. However, at sequence similarity below 40%, this way of inferring protein function is accompanied by an explosion of false positives. Deep learning methods are proposed as a solution. Do these approaches work? In this session, we discuss one of these methods, DeepFam [Seo et al., "DeepFam: Deep learning based alignment-free method for protein family modeling and prediction", Bioinformatics, 34(13):i254-i262, 2018]. Along with this assessment, we also discuss some nuances in classifier performance evaluation that are often overlooked and can result in disappointments when a classifier, which is evaluated as high performing, is deployed.
Home work, due 10/1/2025. Read
[Yu et al., "Accurate prediction and key protein sequence feature
identification of cyclins", Briefings in Functional Genomics,
22:411-419, 2023].
Write a 1-page review report focusing on the way it evaluated the
performance of the proposed cyclin classifier.
Be prepared to make a 5-10 minutes presentation to the class on 13/1/2025.
Protein function prediction and some lessons for classifier performance evaluation, continued
In this session we discuss the paper [Kabir & Wong, "EnsembleFam: Towards more accurate protein family prediction in the twilight zone", BMC Bioinformatics, 23:90, 2022]. This is an interesting paper that uses the concept of "similarity of dissimilarities" to make protein function prediction for twilight-zone proteins, which are proteins with extremely low levels of similarity to reference proteins.
Students are also given the opportunity to present their reports
on their homework from the previous week, viz. [Yu et al., 2023].
Thereby, I hope to help students deepen their understanding of some
additional nuances of classifier performance evaluation.
Gene expression analysis and some lessons for statistical hypothesis testing
Gene expression profiling data is a powerful source of information for understanding biological systems. One of its compelling applications is the identification of differentially expressed genes (DEGs) as biomarkers for disease diagnosis, prognosis, and treatment response. However, DEG selection has been plagued by replicability issues. Many pathway-based methods have been proposed to address this problem. In this session, we discuss the popular overlap-enrichment approach, exemplified by Onto-Express [Draghici et al., "Global functional profiling of gene expression", Genomics, 81(2):98-104, 2003]. In the process, I also expose students to the theory-practice gap that exists when theoretical statistics is applied on real-world data.
Home work, due 24/1/2025. Read
[Srihari et al., "Inferring synthetic lethal interactions from
mutual exclusivity of genetic events in cancer",
Biology Direct, 10:57, 2015].
Write a 1-page review report focusing on the way it tests for
synthetic-lethal gene pairs. Discuss whether their test is a good one.
Be prepared to make a 5-10 minutes presentation to the class on 27/1/2025.
Gene expression analysis and some lessons for statistical hypothesis testing, continued
In this session, we discuss the paper [Lim et al., "A quantum leap in the reproducibility, precision, and sensitivity of gene expression profile analysis even when sample size is extremely small", Journal of Bioinformatics and Computational Biology, 13(4):1550018, 2015]. This paper introduces an interesting pathway-based approach to DEG selection reliably even when sample size is small. Moreover, it also advocates several techniques for demonstrating replicability.
Students are also given the opportunity to present their homework from
the previous week, viz. [Srihari et al., 2015]. Thereby, I hope to
help students deepen their understanding of some additional nuances
of statistical hypothesis testing; in particular, highlighting the
the subtleties involving null distributions.