Exploratory Hypothesis Testing and Analysis

Participants: Andre Suchitra, Haojun Zhang, Wei Zhong Toh, Mengling Feng, Guimei Liu, Limsoon Wong.

Click here for a non-technical ppt of the project.

Background

More and more data have been accumulated and stored in digital format in various applications. These data provide rich sources for making new discoveries. Data mining has become an important tool to transform data into knowledge. Finding useful and actionable knowledge is the main objective of diagnostic data mining. Most existing works tackle the problem by discovering patterns and rules and then studying their interestingness; see our past project on pattern spaces. In this work, we use a different paradigm which represents the discovered knowledge in the form of hypotheses. A hypothesis involves a comparison of two or more samples, which is more or less similar to how human obtain knowledge. Compared with patterns and rules, hypotheses provide the context in which a piece of information is interesting, thus hypotheses are more intuitive and informative than patterns and rules. More importantly, users can take actions more easily based on what a hypothesis indicates. We further analyse the discovered significant hypotheses and identify the reasons behind them so that users not only get to know what is happening but also have some rough ideas on when or why it is happening. This new data mining paradigm has the potential to make diagnostic data mining as successful as predictive data mining in real-life applications. In the proposed research, we will (1) formulate the problem and identify the issues that need to be addressed; (2) develop algorithms to solve the problem; (3) visualize the discovered knowledge to make the system easy to use; (4) interact and cooperate with domain experts in the biomedical area or other areas, and use the developed techniques to solve real-life problems.

Objectives

The main objective of the proposed work is to build a diagnostic data mining system that can:

Help users understand their data and gain new insights. The proposed system enables users to use computational methods to examine large amounts of data automatically. It summarizes data in a comparative way, which makes it easier for users to identify interesting information and the context in which the information is interesting.
Find actionable knowledge. The proposed system automatically identifies interesting phenomena from data as well as possible reasons behind the phenomena, which either suggests possible actions that users can take or provide clues that users can follow for further investigation. For example, given a product engineering dataset, the proposed system can answer the following questions: which product has higher failure rate than other products? Under which situation, the product is more likely to fail?
Make it to real-life applications. Most existing diagnostic data mining systems are not easy to use, and few of them have made it to real-life applications. The proposed system aims to establish a new diagnostic data mining paradigm that is easy to use and represents knowledge in a more intuitive and informative way. This new data mining paradigm has the potential to make diagnostic data mining as successful as predictive data mining in real-life applications.

Thus the scope of this project includes:

Core algorithms. The core algorithms include algorithms for hypothesis generation and analysis, incremental update of generated hypotheses and OLAP operations for exploring hypotheses.
Graphical user interface (GUI). We will design and implement a GUI for the proposed system, which will support some basic functions for summarizing and visualizing data, visualization of discovered knowledge and visualization of OLAP operations.
Solving real-life problems. We plan to apply the developed techniques to solve real-life problems. We are particularly interested in solving problems in the biomedical field since we have good domain expertise; but we are keen to collaborate on other fields.

Main Results

Please see this summary poster for the main technical results of this project.

Selected Publications

Guimei Liu, Mengling Feng, Yue Wang, Limsoon Wong, See-Kiong Ng, Tzia Liang Mah, Edmund Jon Deoon Lee. Towards Exploratory Hypothesis Testing and Analysis. Proceedings of 27th IEEE International Conference on Data Engineering (ICDE), pages 745--756, Hannover, Germany, April 2011. PDF
Guimei Liu, Haojun Zhang, Limsoon Wong. Controlling False Positives in Association Rule Mining. Proceedings of the VLDB Endowment, 5(2):145--156, October 2011. PDF, ARminer Software
Guimei Liu, Haojun Zhang, Limsoon Wong. Finding Minimum Representative Pattern Sets. Proceedings of 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 51--59, Beijing, China, August 2012. PDF
Guimei Liu, Andre Suchitra, Haojun Zhang, Mengling Feng, See-Kiong Ng, Limsoon Wong. AssocExplorer: An association rule visualization system for exploratory data analysis. Proceedings of 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 1536--1539, Beijing, China, August 2012. PDF
Guimei Liu, Andre Suchitra, Limsoon Wong. A performance study of three disk-based structures for indexing and querying frequent itemsets. Proceedings of the VLDB Endowment, 6(7):505--516, May 2013. PDF
Guimei Liu, Haojun Zhang, Limsoon Wong. A flexible approach to finding representative pattern sets. IEEE Transactions on Knowledge and Data Engineering, 26(7):1562-1574, July 2014.
Guimei Liu, Haojun Zhang, Mengling Feng, Limsoon Wong, See-Kiong Ng. Supporting exploratory hypothesis testing and analysis. ACM Transactions on Knowledge Discovery from Data, 9(4):Article 31, April 2015.
Wei Zhong Toh, Kwok Pui Choi, Limsoon Wong. Redhyte: Towards a self-diagnosing, self-correcting, and helpful analytic platform. Proceedings of 8th Asian Conference on Intelligent Information and Database Systems (ACIIDS), Part II, pages 3--12, Da Nang, Vietnam, March 2016. PDF, Demo, Github, Source codes (as of 17/11/2015).
Wei Zhong Toh, Kwok Pui Choi, Limsoon Wong. Redhyte: A self-diagnosing, self-correcting, and helpful hypothesis analysis platform. Journal of Information and Telecommunication, 1(3):241--258, July 2017. PDF
Qian Liu, Jinyan Li, Limsoon Wong, Kotagiri Ramamohanarao. Efficient mining of pan-correlation patterns from time course data. Proceedings of 12th International Conference on Advanced Data Mining and Applications (ADMA), pages 234--249, Gold Coast, Australia, 12-15 December 2016.
Limsoon Wong. Big data and a bewildered lay analyst. Statistics & Probability Letters, 136:73--77, May 2018.
Qian Liu, Shameek Ghosh, Jinyan Li, Limsoon Wong, Kotagiri Ramamohanarao. Discovering pan-correlation patterns from time course data sets by efficient mining algorithms. Computing, 100(4):421--437, April 2018.

Dissertations

QIAN Jiangwen. A hypothesis visualization and query system. Final Year Project Report, School of Computing, National University of Singapore, 2011.
YAP Ying Hui Priscilla. Supporting big data analytics in computational biology and public health. Final Year Project Report, Faculty of Science, National University of Singapore, 2013.
Toh Wei Zhong. Redhyte: An interactive platform for rapid exploration of data and hypothesis testing. Final Year Project Report, Faculty of Science, National University of Singapore, 2015.
Emile Bres. eAnalysis: Easier statistical analysis. MComp thesis, School of Computing, National University of Singapore, 2016.

Selected Presentations

Guimei Liu, Andre Suchitra, Haojun Zhang, Mengling Feng, See-Kiong Ng, Limsoon Wong. AssocExplorer: An association rule visualization system for exploratory data analysis. Demo at 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Beijing, China, August 2012. PDF
Limsoon Wong. Exploratory hypothesis testing and analysis. Invited talk at 3rd IPM-NUS Workshop on Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran, 26 February 2013.
Limsoon Wong. Large-scale bio and medical data mining. Invited master class at UTS AAI Big Data Summer School, Sydney, Australia, 10 April 2013.
Limsoon Wong. Mining testable hypothesis from big data. Invited panel at SNU Bioinformatics Workshop and Biofestival 2013, Seoul National University, Seoul, Korea, 24 May 2013.
Limsoon Wong. More data is not better. Talk at 3rd South Asia Workshop on Research Frontiers in Computing, National University of Singapore, 28 May 2013.
Limsoon Wong. Some issues that are often overlooked in big data analytics. Invited talk at ACM International Conference on Information and Knowledge Management (CIKM), Shanhai, China, 4 November 2014. PPT
Limsoon Wong. Exciting promises and potential pitfalls of big data in biology and medicine. Invited keynote at CAS Shenzhen Institute of Advanced Technology Research Centre for e-Health Opening Ceremony-cum-Workshop, Shenzhen, China, 29 November 2014. PPT
Limsoon Wong. Some often-overlooked issues in analytics. Invited talk at Iran University of Science and Technology, Tehran, Iran, 8 March 2015. PPT
Limsoon Wong. Some issues that are often overlooked in big-data analytics. Keynote talk at 7th International Conference on Knowledge and Systems Engineering (KSE2015), Ho Chi Minh City, Vietnam, 8-10 October 2015.
Limsoon Wong. Some issues that are often overlooked in big data analytics. Invited talk at University of Malaya Symposium on Data Science, Kuala Lumpur, Malaysia, 25 October 2016. PPT
Limsoon Wong. A logician-engineer's adventures in data science and analytics. Invited keynote at International Conference on Intelligent Computing, Instrumentation & Control Technologies, Vimal Jyothi Engineering College, Kannur, Kerala, India, 6 - 7 July 2017. PPT
Limsoon Wong. Anna Karenina and the careless null hypothesis in omics data analysis. Invited talk at IPM-Shanghai University Workshop on Systems Biology, Institute for Research in Fundamental Sciences, Tehran, Iran, 2 - 3 August 2017.
Limsoon Wong. Some simple tactics for deriving a deeper analysis of data. Keynote at 4th International Conference on Computational Science and Technology, Kuala Lumpur, Malaysia, 29 - 30 November 2017. PPT

Acknowledgements

This project is supported in part by a A*STAR PSF grant (SERC 102 101 0030, 1/8/2010 - 31/7/2013) and a MOE T2 grant (MOE2012-T2-1-061, 12/10/2012 - 11/10/2015).

Last updated: 25/6/2018, Limsoon Wong.