GS5002: Academic Professional Skills and Techniques, "Journal Club on Big Data and a Bewildered Lay Analyst", 2021 (AY21/22 Sem 1)

Session #2, 24 August 2021: Ginsberg et al. "Detecting influenza epidemics using search engine query data"

A decade ago, Ginsberg et al. made a remarkable claim that influenza epidemics could be predicted way before traditional health surveillance systems by mining the trends of search terms submitted to Google search engine. The intuition that there is a correlation between what people search for using search engines and influenza epidemics, consumer behavior, etc. has an enduring appeal, and can be of tremendous impact. But does it really work? Why? What have we learned after a whole decade?

Organization of the session

Part I, Background information:

This part deals with background knowledge of cancer biology and statistics that is needed to understand the paper. Some keywords are highlighted below to look up background literature, Wikipedia, etc.

Influenza & US CDC flu surveillance;
Linear regression, auto-regression;
Google trends.

Part II, The paper by Ginsberg et al.

This part presents the paper itself. We want to know the key technical details and the key messages.

Details of how Ginsberg et al. predict influenza epidemics;
Details of how Ginsberg et al. assess how good their predictions;
What the main findings of Ginsberg et al. are.
What was observed a few years after this paper? Was the predictions still good?.

Part III, Possible points for discussion

This part discusses the Ginsberg et al. paper, hopefully in depth. We want to know whether there is any methodological issue, any doubt on the conclusions/key messages, any suggestion for improving the paper. Some pointers for discussion include:

Any methodological issue? E.g. have they overlooked anything?
Any issue on the key messages? E.g. do their results support their conclusions?
Are their predictions based on Google search trends better than simple (auto-)regression on CDC flu caseloads?
What other things people use Google search trends to predict? How effective? Better than more “standard” information (e.g. using ranking on best-seller lists to predict sales of a book)?
Does combining Google search terms and more “standard” information help?

Instructions

The journal club has 4 sessions. We will discuss only 1 paper in each session. I will pick the paper for the 1st session, to set the scene. Hopefully, you will suggest the papers for the subsequent 3 sessions (we will choose by a simple vote from among the suitable papers you suggest.) Any paper can be suggested, so long as it (i) concerns data analysis (esp. big data) and (ii) contains “controversial” analysis or methodological issues that you think are worth for your classmates to appreciate.

For each paper, the presentation is divided into 3 parts: (i) background of the topic/paper – to help students who lacks domain knowledge, (ii) the paper itself - focusing on technical details, and (iii) discussion on the paper. And you will be divided into 3 teams, each team presents one part. The team will rotate through the 3 roles over the 4 sessions. This also mean everyone has to read every paper (plus some related papers/webpages which are helpful for understanding the paper being discussed.)

The grading will be based on presentation (50%), asking and answering questions during the discussion (50%).

Wong Limsoon
20 Aug 2021