Below are materials for the 5 sessions led by Wong Limsoon in Jan & Feb 2023
Golden Thread of Science
I will present some of my favourite invariant-based problem-solving
principles. They are useful for multiple types of problems,
even in different disciplines (I will illustrate using different
areas in computer science, medicine, biology, and biotechnology.)
These principles are simple logical ways to exploit fundamental
properties of each problem domain, highlighting the value of both
logical thought and domain knowledge, and bringing out creative ways
of applying the former to the latter in the context of each problem
being solved.
Art of Principal Component Analysis
PCA is quite often thought of as a means for feature selection
whereby minor PCs are discarded and major PCs are kept. However,
this mechanical way of using PCA is often less than optimal and
sometimes even misguided. Here I present some interesting ways
to think about and use PCA.
Anna Karenina Principle
The Anna Karenina Principle is a manifestation of the
theory–practice gap that exists when theoretical statistics
is applied on real-world data. It derives from the situation
where the null hypothesis is rejected for extraneous reasons
(or confounders), rather than because the alternative hypothesis
is relevant to the disease phenotype. The mechanics of applying
statistical tests therefore must address and resolve confounders.
It is inadequate to simply rely on manipulating the P-value.
Indeed, I will show how/why this can be the wrong thing to do.
I will discuss some mechanistic elements with real-life examples,
and suggest how they can be logically designed to foil the
Anna Karenina Principle.
Twilight Zone of Protein Function Prediction
Generally, if the sequence of two proteins are quite similar, they would
have a common ancestor and would have inherited their function
from that ancestor. Thus, if one knows the function of one of
these two proteins, one can infer the function of the other
protein. However, at sequence similarity below 40%, this way of inferring
protein function is accompanied by an explosion of false positives.
Deep learning methods are proposed as a solution. Do these
approaches work?
Batch effects in omics data
The session looks at a major issue that underlies many omics datasets,
viz. batch effects. Batch effects are technical biases that may
confound analysis of omics data. They are very complex and effective
mitigation is highly context dependent. Do they affect identification
of discriminating/causal factors when we analyze patient datasets?
Do prediction models (constructed on training datasets) work well
on future patients? How do you mitigate batch effects?