[ IVLE ]
[ Overview ]
[ Syllabus ]
[ Grading ]
[ Homework
HW 1
HW 2
>HW 3
HW 4
HW 5 ]
[ Misc. ]
In Homework 3, you will implement a searcher that supports positional indexing and postings compression. To do this, you can choose to further modify either your solution to Homework 2 or use Ziheng's sample solution to Homework 2.
The indexing and query commands will use an identical input format to Homework 2, so that you need not modify any of your code to deal with command line processing. To recap:
Indexing: $ python index.py -i directory-of-documents -d dictionary-file -p
postings-file
Searching: $ python search.py -d dictionary-file -p postings-file -q
file-of-queries -o output-file-of-results
We will also again use the Reuters training data set provided by
NLTK. Depending on where you specified the directory for NLTK data
when you first installed the NLTK data (recall that the installation
is triggered by nltk.download()), this data set is located under a
path like: .../nltk_data/corpora/reuters/training/
.
As in Homework 2, your search process must not keep the postings file in memory in answering queries, but instead use a file cursor to randomly access parts of the postings file. However, there is no such restriction for the indexer (you are not asked to implement BSBI or SPIMI).
Your searching code now needs to additionally support phrasal queries, such as:
"New York" OR Boston
To do this, you need to change your data structure for postings to include information about where the words occur. You only have to implement correct phrasal query handling (no proximity queries). You will need to modify your indexing scheme to change the data structure used for postings. Slightly more tricky is the modifications you need to make to your retrieval code (We can regard a phrasal query as an AND query but with specific positional constraints).
Your implementation needs to make proper use of skip pointers at least at the document level. Skip pointers are also helpful in storing and accessing positional information, but you do not have to implement skip pointers for the positional information (but you may if you want, no bonus will be given to this, although it may help your search engine's efficiency).
Your code should implement postings compression as described in Chapter 5 of IIR. Implement gap-based compression for postings (both for document postings as well as positional postings). Use variable byte encoding (VBE) as described in the textbook (this requires bit manipulation in Python; note you can only read bytes in Python and subsequently manipulate bits using bit operations). Make sure to implement this functionality over several steps so that you may answer the essay question properly, see below.
index.py
,
search.py
, dictionary.txt
, and
postings.txt
. Please do not include the Reuters
data in your submission.You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question.
he
and
"the who"
are disallowed but "bmw 325"
and
"national unversity of singapore"
are allowed. Could
we make any cost savings in indexing at either indexing or query
time? ("Cost savings" here covers both space and/or speed).
The instructions below are repeated for clarity sake. Only parts highlighted in red are different from previous assignments.
You are allowed to do this assignment individually or as a team of two. There will be no difference in grading criteria if you do the assignment as a team or individually. Only one student in a team of two need to submit the assignment. For the submission information below, simply replace any mention of a matric number with the two matric numbers concatenated with a separating dash (e.g., U000000X-U000001Y).
For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:
These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: Thursday, 17 Mar 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.
The grading criteria for the assignment is tentatively:
Disclaimer: percentage weights may vary without prior notice.
Min-Yen Kan <kanmy@comp.nus.edu.sg> Mon Feb 21 11:55:32 2011 | Version: 1.0 | Last modified: Thu Mar 31 23:09:40 2011