Dealing with Confounders in Omics Analysis

Participants: Wilson Goh, Wong Limsoon

Overview

Statistical feature selection on high-throughput omics data (e.g., genomics, proteomics, and transcriptomics) is commonly deployed to help understand the mechanism underpinning disease onset and progression. In clinical practice, these features are critical as biomarkers for diagnosis (see Glossary), guiding treatment, and prognosis. Unlike monogenic disorders, many challenging diseases (e.g., cancer) are polygenic, requiring multigenic signatures to counteract etiology and human variability issues. Unfortunately, in the course of analyzing omics data, we commonly encounter universality and reproducibility problems due to etiology and human variability, but also batch effects, poor experiment design, inappropriate sample size, and misapplied statistics.

Current literature mostly blames poor experiment design and overreliance on the highly fluctuating P-value. In this project, we explore a deeper rethink on the mechanics of applying statistical tests (e.g. hypothesis statement construction, null distribution appropriateness, and test-statistic construction), and design analysis techniques that are robust on omics data.

Selected Publications

Wilson Wen Bin Goh, Limsoon Wong. Why batch effects matter in omics data, and how to avoid them. Trends in Biotechnology, 35(6):498--507, June 2017.
Wei Zhong Toh, Kwok Pui Choi, Limsoon Wong. Redhyte: A self-diagnosing, self-correcting, and helpful hypothesis analysis platform. Journal of Information and Telecommunication, 1(3):241--258, July 2017. PDF
Limsoon Wong. Big data and a bewildered lay analyst. Statistics and Probability Letters, 136:73--77, May 2018.
Wilson Wen Bin Goh, Limsoon Wong. Dealing with confounders in -omics analysis. Trends in Biotechnology, 36(5):488--498, May 2018.
Wilson Wen Bin Goh, Limsoon Wong. Why breast cancer signatures are no better than random signatures explained. Drug Discovery Today, 23(11):1818--1823, November 2018.
Wilson Wen Bin Goh, Limsoon Wong. Turning straw into gold: Building robustness into gene signature inference. Drug Discovery Today, 24(1):31--36, January 2019.
Wilson Wen Bin Goh, Limsoon Wong. The birth of bio-data science: Trends, expectations, and applications. Genomics, Proteomics, & Bioinformatics, 18(1):5--15, February 2020. PDF
Sung Yang Ho, Limsoon Wong, Wilson Wen Bin Goh. Avoid oversimplifications in machine learning: Going beyond the class-prediction accuracy. Patterns, 1(2):100025, May 2020. PDF
Sung Yang Ho, Sophia Tan, Chun Chau Sze, Limsoon Wong, Wilson Wen Bin Goh. What can Venn diagrams teach us about doing data science better?. International Journal of Data Science and Analytics, 11(1):1--10, January 2021. PDF
Yaxing Zhao, Limsoon Wong, Wilson Wen Bin Goh. How to do quantile normalization correctly for gene expression data analysis. Scientific Reports, 10:15534, September 2020. PDF
Sung Yang Ho, Kimberly Phua, Limsoon Wong, Wilson Wen Bin Goh. Extensions of the external validation for checking learned model interpretability and generalizability. Patterns, 1(8):100129, November 2020. PDF
Chern Han Yong, Shawn Hoon, Sanjay De Mel, Stacy Xu, Jonathan Adam Scolnick, Xiaojing Huo, Michael Lovci, Wee Joo Chng, Limsoon Wong. MapBatch: Conservative batch normalization for single cell RNA-sequencing data enables discovery of rare cell populations in a multiple myeloma cohort. Blood, 138(Suppl. 1):2954-2954, November 2021. PDF
Li Rong Wang, Limsoon Wong, Wison Wen Bin Goh. How Doppelganger effects in biomedical data confound machine learning. Drug Discovery Today, 27(3):678--685, March 2022.
Wilson Wen Bin Goh, Chern Han Yong, Limsoon Wong. Are batch effects still relevant in the age of big data? Trends in Biotechnology, 40(9):1029--1040, September 2022.
Wilson Wen Bin Goh, Reuben Jyong Kiat Foo, Limsoon Wong. What can scatterplots teach us about doing data science better?. International Journal of Data Science and Analytics, 17:111--125, 2024. DOI: https://doi.org/10.1007/s41060-022-00362-9.
Wei Xin Chan, Limsoon Wong. Accounting for treatment during the development or validation of prediction models. Journal of Bioinformatics and Computational Biology, 20(6):2271001, December 2022. PDF
Wei Xin Chan, Limsoon Wong. Obstacles to effective model deployment in healthcare. Journal of Bioinformatics and Computational Biology, 21(2):2371001, April 2023. PDF
Wilson Wen Bin Goh, Harvard Wai Hann Hui, Limsoon Wong. How missing value imputation is confounded with batch effects and what you can do about it. Drug Discovery Today, 28(9):103661, September 2023. PDF
Lakshmi Alagappan, Jia En Chu, Joanna Huixin Chua, Jia Wen Ding, Ronghui Xiao, Zhe Yu, Kun Pan, Untzizu Elejalde, Kevin Junliang Lim, Limsoon Wong. Class-specific correction and classification of NIR spectra of edible oils. Chemometrics and Intelligent Laboratory Systems, 241:104977, September 2023. PDF
Wilson Wen Bin Goh, Mohammad Neamul Kabir, Sehwan Yoo, Limsoon Wong. Ten quick tips for ensuring machine learning model validity. PLoS Computational Biology, 20(9):e1012402, September 2024. PDF
Xizi Luo, Andre Huikai Lin, Song Yi Amadeus Chi, Limsoon Wong, Chowdhury Rafeed Rahman. Benchmarking recent computational tools for DNA-binding protein identification. Briefings in Bioinformatics, 26(1):bbae634, January 2025. PDF
Chowdhury Rafeed Rahman, Anders Jacobsen Skanderup, Zhong Wee Poh, Limsoon Wong. GCfix: A fast and accurate fragment length-specific method for correcting GC bias in cell-free DNA. Bioinformatics, in press.

Selected Presentations

Limsoon Wong. Some issues that are often overlooked in big data analytics. Invited talk at University of Malaya Symposium on Data Science, Kuala Lumpur, Malaysia, 25 October 2016. PPT
Limsoon Wong. A logician-engineer's adventures in data science and analytics. Invited keynote at International Conference on Intelligent Computing, Instrumentation & Control Technologies, Vimal Jyothi Engineering College, Kannur, Kerala, India, 6 - 7 July 2017. PPT
Limsoon Wong. Anna Karenina and the careless null hypothesis in omics data analysis. Invited talk at IPM- Workshop on Systems Biology, Institute for Research in Fundamental Sciences, Tehran, Iran, 2 - 3 August 2017. PPT
Limsoon Wong. Some simple tactics for deriving a deeper analysis of data. Keynote at 4th International Conference on Computational Science and Technology, Kuala Lumpur, Malaysia, 29 - 30 November 2017. PPT
Limsoon Wong. Big data and a bewildered lay analyst. Talk at BUET-NUS Computer Science Workshop, Dhaka, Bangladesh, 2 March 2018. PPT
Limsoon Wong. From bewilderment to enlightenment: Logic in cancer research. Plenary talk at 5th NCIS Annual Research Meeting (NCAM2018), National University Hospital, Singapore, 3 August 2018. PPT
Limsoon Wong. Dealing with confounders in omics data analysis. Plenary talk at 14th International Conference on Intelligent Computing, Wuhan, China, 15-18 August 2018.
Limsoon Wong. From bewilderment to enlightenment: Logic in cancer research. Keynote address at International Multi-Conference on Engineering and Technology Innovation (IMETI 2018), Taoyuan, Taiwan, 2-6 November 2018. PPT
Limsoon Wong. Dealing with confounders in omics data analysis. Invited talk at 9th International Conference on Computational Systems Biology and Bioinformatics (CSBio2018), Bangkok, Thailand, December 2018. PPT
Limsoon Wong. From bewilderment to enlightenment in cancer research... hopefully. Invited talk at GeCo Workshop on Challenges in Data-Driven Genomic Computing, Villa del Grumello, Como, Italy, 6-8 March 2019. PPT
Limsoon Wong. Some opinion and advice on machine learning in population-based genomic medicine. Invited talk at NUS-Cambridge Joint Research Symposium n Population-based Genomic Medicine, National University of Singapore, 24-25 October 2019. PPT
Limsoon Wong. Conservative batch-effect correction for single-cell RNA-seq data enables discovery of rare cell populations. Invited keynote at 1st International and 10th National Iranian Conference on Bioinformatics (ICB10), Kish Island, Iran, 22-24 February 2022.
Limsoon Wong. Some bad practices in data analysis and machine learning. Invited keynote at 9th IEEE International Conference on Data Science and Advanced Analytics (DSAA2022), Shenzhen, China, 13-16 October 2022. PPT
Limsoon Wong. Single-cell RNA-seq dataset integration without loss of unique rare cell populations. Invite talk at 5th China-ASEAN Forum on Health Cooperation, Nanning, Guangxi, China, 26 - 27 May 2023. MP4
Limsoon Wong. The hidden truths of principal component analysis. Distinguished lecture at Hong Kong Baptist University, 18 October 2024.
Limsoon Wong. Single-cell RNA-seq dataset integration without loss of unique rare cell populations. Invited keynote at International Conference on Quantum & AI Technologies in Biomedical Science, National Taiwan University, Taipei, Taiwan, 7-9 April 2025.

Acknowledgements

This project is supported in part by a Kwan Im Thong Hood Cho Temple Chair Professorship, and in part by two AI Singapore grants (AISG-100E-2019-027 and AISG-100E-2019-028) and a Singapore Ministry of Education tier-2 grant (MOE2019-T2-1-042).

Last updated: 23 June 2025, Limsoon Wong.