CS 3245 - Information Retrieval ť Homework 4

Last updated: Saturday, March 26, 2011 10:31:08 PM SGT - Added sort order for documents with tied relevance.

In Homework 4, you will implement a searcher that supports scoring using vector space model. To do this, you can choose to further modify either your solution to Homework 2 or 3 or use Ziheng's sample solution to Homework 3 (the solution will be released later).

Commonalities with Homework 3

The indexing and query commands will use an identical input format to Homework 4, so that you need not modify any of your code to deal with command line processing. To recap:

Indexing: $ python index.py -i directory-of-documents -d dictionary-file -p postings-file

Searching: $ python search.py -d dictionary-file -p postings-file -q file-of-queries -o output-file-of-results

We will also again use the Reuters training data set provided by NLTK. Depending on where you specified the directory for NLTK data whenyou first installed the NLTK data (recall that the installation is triggered by nltk.download()), this data set is located under a path like: .../nltk_data/corpora/reuters/training/.

As in Homework 2 and 3, your search process must not keep the postings file in memory in answering queries, but instead use a file cursor to randomly access parts of the postings file. However, there is no such restriction for the indexer (you are not asked to implement BSBI or SPIMI). Also, you should continue to employ Porter stemming in your submission, both for document indexing as well as query processing.

Vector Space Model

To implement the VSM, you may choose to implement (you can can do it differently) your dictionary and postings lists in the following format. The only difference between this format and that in Figure 1.4 in the textbook, is that you encode term frequencies in the postings for the purpose of computing tf×idf. The tuple in each posting represents (doc ID, term freq). You don't need to do dictionary/postings compression, although you're welcomed to write in/retain this capability.

term	doc freq (df)	→	postings lists
ambitious	5	→	(1, 5)→ (7,2) → (21, 7) → ...
...	...		...

In the searching step, you will need to rank documents by cosine similarity based on tf×idf. In terms of SMART notation of ddd.qqq, you will need to implement the common lnn.ltc ranking scheme (i.e., log tf and no idf for documents, and log tf and idf for queries. Only the query component needs to be cosine normalized). Compute cosine similarity between the query and each document, with the weights follow the tf×idf calculation, where term freq = 1 + log(tf) and inverse document frequency idf = log(N/df) (for queries). That is,

tf-idf = (1 + log(tf)) * log(N/df).

It's suggested that you use log base 10 for your logarithm calculations (i.e., math.log(x, 10), be careful of cases like math.log(0, 10)). The queries we provide are now free text queries, i.e., you don't need to use query operators like AND, OR, NOT and parentheses, and no phrasal queries^*. These free text queries are similar to those you type in the Google search bar.

Your searcher should output a list of up to 10 most relevant (less if there are fewer than ten documents that have matching stems to the query) docIDs in response to the query. These documents need to be ordered by relevance, with the first document being most relevant. For those with same relevance, further sort them by the increasing order of the docIDs.

What to turn in?

You are required to submit index.py, search.py, dictionary.txt, and postings.txt. Please do not include the Reuters data in your submission.

Essay questions

You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question.

In this assignment, we didn't ask you to support phrasal queries, which is a feature that is typically supported in web search engines. Describe how you would support phrasal search in conjunction with the VSM model. A sketch of the algorithm is sufficient. (For those of you who like a challenge, please go ahead and implement this feature in your submission but clearly demarcate it in your code and allow this feature to be turned on or off using the command line switch "-x" (where "-x" means to turn on the extended processing of phrasal queries). We will give a small bonus to submissions that achieve this functionality correctly).
Describe how your search engine reacts to long documents and long queries as compared to short documents and queries. Is the normalization you use sufficient to address the problems (see Section 6.4.4 for a hint)? In your judgement, is the lnc.ltc scheme sufficient for retrieving documents from the Reuters-21578 collection?
Do you think zone or field parametric indices would be useful for practical search in the Reuters collection? Note: the Reuters collection does have metadata for each article but the quality of the metadata is not uniform, nor are the metadata classifications uniformly applied (some documents have it, some don't).

Submission Formatting

The instructions below are repeated for clarity sake.

You are allowed to do this assignment individually or as a team of two. There will be no difference in grading criteria if you do the assignment as a team or individually. Only one student in a team of two need to submit the assignment. For the submission information below, simply replace any mention of a matric number with the two matric numbers concatenated with a separating dash (e.g., U000000X-U000001Y).

For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:

A plain text documentation file README-<matric number>.txt (e.g., README-U000000X.txt): this is a text only file that describes any information you want me to know about your submission. You should not include any identifiable information about your assignment (your name, phone number, etc.) except your matric number and email (we need the email to contact you about your grade, please use your u*******@nus.edu.sg address, not your email alias). This is to help you get an objective grade in your assignment, as we won't associate matric numbers with student names. Please indicate in your README whether you modified the sample solution to Homework #2, modified your own Homework #2 or #3 or used Ziheng's model code or wrote your code from scratch.
All source code. We will be reading your code, so please do us a favor and format it nicely.
A plain text file essay-<matric number>.txt that contains your answers to the essay questions.

These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: Thursday, 31 Mar 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.

Grading Guidelines

The grading criteria for the assignment is tentatively:

40% VSM handling and efficiency
20% Search results correctness
10% Documentation
30% Essay questions

Disclaimer: percentage weights may vary without prior notice.

Hints

Indexing a large number of documents may take a while and a lot of space. Again, we suggest that you just test and develop your system on a subset that will make it quick to do experimentation.
Working in a group can be helpful but can be counter-productive too. I have observed that group work tends to make homework submissions grades slightly higher than single submissions but also more average (it's harder to have an outstanding grade with a group). If you think you work better in a group of two, please do. Make sure to clearly partition the work and demarcate the API between your modules.
While you don't need to print scores out for the official runs that your system will be graded on, you may find it useful to include such information in your debugging output when developing your solution.
A consequence of only needing the top 10 documents is that you can speed up search by using a max-heap data structure to process the document list. You could set the max-heap to contain only 100 documents (fewer, if you'd like) and at the end of processing, just report the max 10. Note that the max-heap approach speeds ranking relatively more in document collections that are significantly larger than Reuters-21578 (our collection is a bit too small to have this give much efficiency gain. To think about: why?)

Min-Yen Kan <kanmy@comp.nus.edu.sg> Thu Mar 17 08:45:44 SGT 2011 | Version: 1.0 | Last modified: Thu Mar 31 23:09:31 2011

CS 3245 ť Homework 4 - The Vector Space Model