BS6213 - The Reflective Scientist

Below are materials for the 5 sessions led by Wong Limsoon in Jan & Feb 2023

Brief description of the course and assessment
Session #1, 7 Jan 2023, PPT
Golden Thread of Science
I will present some of my favourite invariant-based problem-solving principles. They are useful for multiple types of problems, even in different disciplines (I will illustrate using different areas in computer science, medicine, biology, and biotechnology.) These principles are simple logical ways to exploit fundamental properties of each problem domain, highlighting the value of both logical thought and domain knowledge, and bringing out creative ways of applying the former to the latter in the context of each problem being solved.
Session #2, 9 Jan 2023, PPT
Art of Principal Component Analysis
PCA is quite often thought of as a means for feature selection whereby minor PCs are discarded and major PCs are kept. However, this mechanical way of using PCA is often less than optimal and sometimes even misguided. Here I present some interesting ways to think about and use PCA.
Session #3, 16 Jan 2023, PPT
Anna Karenina Principle
The Anna Karenina Principle is a manifestation of the theory–practice gap that exists when theoretical statistics is applied on real-world data. It derives from the situation where the null hypothesis is rejected for extraneous reasons (or confounders), rather than because the alternative hypothesis is relevant to the disease phenotype. The mechanics of applying statistical tests therefore must address and resolve confounders. It is inadequate to simply rely on manipulating the P-value. Indeed, I will show how/why this can be the wrong thing to do. I will discuss some mechanistic elements with real-life examples, and suggest how they can be logically designed to foil the Anna Karenina Principle.
Session #4, 30 Jan 2023, PPT
Twilight Zone of Protein Function Prediction
Generally, if the sequence of two proteins are quite similar, they would have a common ancestor and would have inherited their function from that ancestor. Thus, if one knows the function of one of these two proteins, one can infer the function of the other protein. However, at sequence similarity below 40%, this way of inferring protein function is accompanied by an explosion of false positives. Deep learning methods are proposed as a solution. Do these approaches work?
Session #5, 6 Feb 2023, PPT
Batch effects in omics data
The session looks at a major issue that underlies many omics datasets, viz. batch effects. Batch effects are technical biases that may confound analysis of omics data. They are very complex and effective mitigation is highly context dependent. Do they affect identification of discriminating/causal factors when we analyze patient datasets? Do prediction models (constructed on training datasets) work well on future patients? How do you mitigate batch effects?