CS 5246 - Text Processing and the Web > Homework

Homework #2 - Factual List Question Answering

In this assignment, you will developing a question answering system for list questions. A list question differs from the traditional factoid question in that there are multiple correct answers. As such, a question answering system that answers list questions is assessed on the completeness of the list returned to the user. Like factoid question answering, the answer returned should be an exact answer -- additional verbiage outside of the question will be penalized.

Your system will be fed correctly spelled, well specified questions in English. They will be in the format of a question asked on a single line of input, provided to your program by standard input.

Your program should return a list of relevant answers on standard output. No difference in score will be assessed to answers on different lines; all answers are judged equally important.

For example, given a list question such as "List the public universities in Singapore", the (current) answers should be "National University of Singapore and Nanyang Technological University" (SMU is a private university, but some of its funding comes from public coffers). Each result should on a separate line, written in UTF-8, and separated by a carriage return. If there are no valid answers to the question, a single line response with the word "nil" should be returned. No question will have more than fifty correct answers, and all questions that do not have a specific timeframe indicated will refer to the present answer (vs. historical).

To assess your list QA system, we'll be again testing it with some training and test questions. Below are five list questions which your system should be able to return a correct result, which are provided for you for training and development of your system.

List the lines in the Hong Kong MTR system: Tsuen Wan, Kwun Tong, Island, Tung Chung, Airport Express, Tseung Kwan O, Disneyland Resort.
List the member countries of the Association of Southeast Asian Nations (ASEAN): Brunei Darussalam, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, Vietnam
List the classic ice cream flavors that are produced by Häagen-Dazs: Baileys® Irish cream, Banana split, Black walnut, Butter pecan, Caramel cone, Cherry vanilla, Chocolate, Chocolate chip cookie dough, Chocolate chocolate chip, Chocolate peanut butter, Cinnamon dulce de leche, Coffee, Cookies & cream, Crème Brulée, Dulce de leche, Mango, Mint chip, Mocha chip, Pineapple coconut, Pistachio, Rocky road, Rum raisin, Sticky toffee pudding, Strawberry, Vanilla, Vanilla bean, Vanilla chocolate chip, Vanilla swiss almond, White chocolate raspberry truffle
List the world's rivers that are over 8000 kilometers (km) long: nil
List the Exchange Traded Funds (ETFs) that are listed on the Singapore Exchange (SGX) that are not U.S. cross-listed: ABF Singapore Bond Index Fund, CIMB FTSE ASEAN40 ETF, Daiwa FTSE Shariah Japan 100, iShares MSCI India ETF, Lyxor ETF China Enterprise (HSCEI), Lyxor ETF Commodities CRB, Lyxor ETF Hong Kong (HSI), Lyxor ETF India (S&P CNX Nifty), Lyxor ETF Japan (TOPIX©), Lyxor ETF MSCI AC Asia-Pacific Ex Japan, Lyxor ETF Korea, Lyxor ETF Taiwan, SPDR® GOLD SHARES, streetTRACKS® Strait Times Index Fund

Aside from these five questions, there will be two additional sources of questions. Each homework submission (individual or group) will need to come up with a single list question and its correct answer by Week 9 (15 Oct) of the assignment and will post it to the forum. This is counted as a deliverable for your assignment, and will be graded. Together with the above five questions, the answers to these questions will form the known set of list questions that your system will be graded on. (Update 4 Nov: An updated zip file of the queries is now available: hw2-needs-v2.0.zip

An additional five hidden (test) questions, will be revealed after the assignment is due. Your system will also be assessed on these test questions. The hidden questions will be slightly higher in assessment weight than the training questions. Minor typographical differences (capitalization, diacritics) as well as misspellings and variations on names will be counted as correct. All answers should be in English where possible (e.g., Question 1 above also has corresponding Chinese answers).

You can again work in teams of two or individually for this assignment. There will be no adjustment to scores in factoring for whether the assignment is done in a team or individually. However, if you had previously worked in Homework #1 as a group, you will not be allowed to do this assignment in a group, you may only do this individually; the group option is only open to those who have done Homework #1 as individuals.

Note that since this is an assignment that comprises at least 25% of your grade, I expect the level of effort for this assignment to be similar. You have five weeks to do this assignment. The list questions will all be numbered and be made available as a zip archive, following the submission and verification of all list questions by Week 10.

Restrictions: You are allowed to use any resources on the web that are not themselves list or factoid question answering systems. For example, submitting the questions to Ask.com's question answering service is not allowed. Also, retrieving and analyzing this specific homework page (which has the answers to the five training questions), is not allowed -- you may have to write code specifically to discard URLs / web resources that come from http://www.comp.nus.edu.sg.

What to turn in

You will upload an HT0000000.zip (where HT0000000 is your matric ID, where all letters are in uppercase) archive by the due date, consisting of the following four sets of items. Please use a ZIP (not RAR, B2Z or TAR) utility to construct your submission. Do not include a subdirectories in the submission to extract to (e.g., unzipping X.zip should give files like X.sum, not X/X.sum or submission/X.sum). Please use all capital letters when writing your matric number (matric numbers should start with U, NT, HT or HD for all students in this class). Your cooperation with the submission format will allow me to grade the assignment in a timely manner. Note that I do not want to know who you are, with respect to grading assignments, so it is important that you try not to reveal your identity in your submission. Please follow the below instructions to the letter.

A summary file in plain text (not MS Word, not OpenOffice), that describes your submission and the architecture for retrieval. You should include your matric number and your NUS (u|g) prefixed email address as the only form of ID. In this file you also need to describe how your source code can be built and executed on sf3/sunfire. If your submission cannot be run on sunfire, you'll need to demonstrate it to me, sometime soon after the submission date (by downloading your submission file and running it on your system). You should include notes about the development of your submission, and special features that you developed to handle the structure of the queries and documents (filename: ReadmeHT0000000.txt, where HT0000000 is your matric ID). Warning! If you use any lexicons, resources, code or algorithmic description that are beyond the references on this page, you need to give proper credit and acknowledge the contribution of others. Please cite or acknowledge work that helped you that you did not do on your own. I will deduct the credit accordingly, if applicable. Failure to acknowledge your sources constitutes plagiarism and will be punished accordingly.
The files for the retrieval results for all public queries. These should be in a similar form to the gold-standard files; the list question ID on the first line and the answers on the following lines. Each answer line should have the exact answer, followed by a tab (\t) character, followed by a URL where the answer was extracted from These files should named nX.txt, where X should be replaced by the list question ID. A sample file is here.
Your source code tree. These should be relatively well documented so that I can follow the logic of your code, with the help of the ReadmeHT0000000.txt file. Typing in "make" or "ant" should build the appropriate code, such as an executable, if needed. In your assignment submission, please do not assume that any environment variables (e.g., PATH and CLASSPATH) are necessarily correctly set. The executable file to run your system should be named runHT0000000 (where HT0000000 is to be replaced by your matric number, as above) and be set as executable (by you or by your buildfile if it is compiled).

Grading scheme

Your grade will take into account 1) features used, 2) retrieval accuracy, 3) peer annotation, 4) documentation and 5) time efficiency. These factors are listed in order of importance/weighting to your final grade for the assignment. Warning -- I will be reading your code, so please make sure it is tidy and well documented.

[41 percent] Features used. This will be judged on the basis on your code and your summary file. What features do you use, whether you take advantage of the semi-structure in the input, how you modified the ranking score to get the final results.
[37 percent] Retrieval accuracy. This will be judged based on the pooled list judgments that all students turn in (the nX-answers.txt files in your submission. I will also include some additional test queries that you will not know ahead of time. I'll be using the average instance precision and instance recall metrics, as used in the TREC 2004 list question evaluation, except that a "nil" answer will only be scored with an IP/IR score of 1 if and only if the system returns only "nil" as the only answer. Note that minor changes to the answer (case differences, missing diacritics, etc.) that do not change the semantics of the answer
[7 percent] List Question and Answer. To judge #2 (retrieval accuracy) I will be looking at your submitted question and its answer, for completeness, lack of ambiguity and possible judgment. As discussed in class, questions such as "Where is the Taj Mahal?" would be considered a poor question (since several answers are possible). Your list question should include any expansions of acronyms and should have at most one scoping clause (e.g., in question 5, "that are not U.S. cross-listed"). Note: You may decide to ask a question that generates a nil answer -- you are not obligated to have a question that has answers.
[13 percent] Documentation. How well the summary file and source code is documented. This will include how easy it is for me to run your software and the state of your code (is it readable, and the workflow well partitioned?).
[2 percent] Time efficiency of the system. As long as the system takes no longer than 5 minutes to produce the results for a question, it will be considered satisfactory. Again, the purpose of this is to ensure that your system can be run and graded within three weeks.

Due date and late policy

According to the syllabus, this homework is due on 5 Nov at 11:59 pm SGT. Submit your zip file to the IVLE workbin by this time. The late policy for submissions applies as per the policy set forth on the "Grading" page.

References

The BOSS homepage. Probably not as useful as the forum or the PDF documentation.
wget - an open-source command-line URL fetching utility. Also already installed on sunfire. Recommended for interacting with BOSS.
A slightly outdated list of QA system components and papers by Nimar S. Arora, of the UC Berkeley TREC group.
You may want to read Ellen Voorhees' paper, Implementing a Question Answering Evaluation which touches on question formulation, before deciding on your training question for Week 9.
Hui Yang, a former MS student here, worked quite extensively on list questions. You may want to read her techniques in finding list answers from her publications.
You'll find that a number of sites on the web contain lots of factual information that can be mined (see the note below about research systems too). Some of these sites may be useful to you. If you find others, please list them in the forum.
- General enclopedia, events - Wikipedia
- Geographic facts - CIA Factbook
- Famous People - Biography.com
- FAQs - Yahoo! Answers, Google Knol, phpBB sites
- Song lyrics, Famous quotations, etc.

Hints

You can partially leverage on the knowledge and the system that have built previously in Homework #1. In particular, you may harness Yahoo! BOSS again as was done in homework #1.
You can use external sources in RPNLPIR (such as lexica like WordNet or statistics like IDF statistics over the WebBase corpus) to assist your programs. If you do plan to use external resources, please be aware that they take time to compile and preprocess into a useable form for you to take advantage of.
You may note many research systems (including ones created here by Prof. Chua Tat-Seng's group), mine and use resources on the web. You may want to look into integrating these with your homework submission. This is at your discretion however, as for a short homework assignment such as this one, you may find it better to concentrate on algorithm design, rather than compiling resources. To create a simple version of mining resources, consider using the "site:" query restrictor in search engine queries.
You may find downloading the documents yourself and processing them may be helpful. If you do download documents, please note that given the five minute deadline for each query, please make sure you that your program doesn't hang if faced with a recalcitrant page download.

Min-Yen Kan <kanmy@comp.nus.edu.sg> Created on: Mon Sep 29 22:58:43 SGT 2008 | Version: 1.0 | Last modified: Tue Nov 4 09:26:31 2008

CS 5246 - Text Processing and the Web

Menu