[ IVLE ]
[ Overview ]
[ Syllabus ]
[ Grading ]
[ Homework
HW 1
>HW 2
HW 3
HW 4
HW 5 ]
[ Misc. ]
In Homework 2, you will implement indexing and searching techniques for Boolean retrieval described in Lectures 2 and 3.
Your indexing script, index.py
, should be called in this format:
$ python index.py -i directory-of-documents -d dictionary-file -p
postings-file
Documents to be indexed are stored in directory-of-documents. In
this homework, we are going to use the Reuters training data set
provided by NLTK. Depending on where you specified the directory for
NLTK data when you first installed the NLTK data (recall that the
installation is triggered by nltk.download()), this data set is
located under a path like:
.../nltk_data/corpora/reuters/training/
.
Recall that the dictionary is commonly kept in memory, with
pointers to each postings list, which is stored on disk. This is
because the size of dictionary is relatively small and consistent,
while the postings can get very large when we index millions of
documents. At the end of the indexing phase, you are to write the
dictionary into dictionary-file
and the postings into
postings-file
. For example, the following command writes
the dictionary and postings into dictionary.txt
and
postings.txt
.
$ python index.py -i
/home/linzihen/nltk_data/corpora/reuters/training/ -d dictionary.txt -p
postings.txt
Although you can use any file names as you like, in this homework
please follow the above command to use dictionary.txt
and
postings.txt
, so that our marking script can easily
locate the files.
In order to collect the vocabulary, you need to apply tokenization
and stemming on the document text. You should use the NLTK tokenizers
(nltk.sent_tokenize()
and
nltk.word_tokenize()
) to tokenize sentences and words,
and the NLTK Porter stemmer to do stemming. You need to do
case-folding to reduce all words to lower case.
You also need to implement skip pointers in the postings lists. Implement the method described in the lecture, where sqrt(len(posting)) skip pointers are evenly placed on the a postings list. Although implementing skip pointers takes up extra disk space, it provides an shortcut to efficiently merge two postings lists, thus boosting the searching speed.
Here is the command to run your searching script,
search.py
:
$ python search.py -d dictionary-file -p postings-file -q
file-of-queries -o output-file-of-results
dictionary-file
and postings-file
are the
output files from the indexing phase. Queries to be tested are stored
in file-of-queries
, in which one query occupies one
line. Your answer to a query should contain a list of document IDs
that match the query in increasing order. In the Reuters corpus, the
document IDs should follow the filenames (that is, your indexer should
assign its document ID 1 to the filename named "1"; also note that while
Reuters doc IDs are unique integers, they are not necessary
sequential). For example, if three documents 12, 40 and 55 are found in
the search, you should write "12 40 55" into output-file-of-results in
one line. When no document is found, you should write an empty
line. The results in output-file-of-results
should
correspond to the queries in file-of-queries.
Your program should not read the whole postings-file into
memory, because in practice, this file may be too large to fit into
memory when we index millions of documents. Instead, you should use
the pointers in the dictionary to load the postings lists from the
postings-file. Make sure you use the seek
and
read
I/O functions that come from python's IO
library for this.
The operators in the search queries include: AND
,
OR
, NOT
, (
, and )
.
You can safely assume that there is no nested parentheses, for
example, the query (a AND (b OR c))
will not occur. This
assumption is natural to how humans design their boolean search
queries. However, there is no restriction on the length of the
query. Note that parentheses have higher precedence than
AND
, OR
and NOT
, while
AND
, OR
and NOT
have the same
level of precedence. AND
and OR
are binary
operators, while NOT
is a unary operator. Below is an
illustration of a valid example query:
bill OR Gates AND (vista OR XP) AND NOT mac
While indexing is an off-line phase, searching is designed to be real-time (the extreme example is Google Instant), thus efficiency is very important in searching. In this homework, we won't be evaluating based on how fast your program can index a list of documents (the preprocessing), but we will test the efficiency of your searching program (runtime speed), as well as its accuracy.
index.py
,
search.py
, dictionary.txt
, and
postings.txt
. Please do not include the Reuters data. You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question. You may receive a small amount of extra credit if you can support your answers with experimental results.
sent_tokenize()
and word_tokenize()
? Can
you propose rules to further refine these results?You are allowed to do this assignment individually or as a team of two. There will be no difference in grading criteria if you do the assignment as a team or individually. Only one student in a team of two need to submit the assignment. For the submission information below, simply replace any mention of a matric number with the two matric numbers concatenated with a separating dash (e.g., U000000X-U000001Y).
For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:
These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: 2012 Feb 27, 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.
The grading criteria for the assignment is tentatively:
Disclaimer: percentage weights may vary without prior notice.
seek()
, rewind()
,
tell()
and read()
. Another Python module to look at is linecache
. Please look
through the documentation or web pages for these.
Ziheng Lin <linzihen@comp.nus.edu.sg> Wed Feb 2 12:58:35 2011 | Version: 1.0 | Last modified: Sat Mar 24 12:31:30 2012