CS 5246 - Text Processing and the Web

Updated: Wed Jan 31 17:03:05 GMT-8 2007 : Added new renamed corpus link.

Homework #1 - Search Engine for Scientific Papers

In this assignment, you will developing a search engine for scientific papers. Like many inputs in other domains, the input is largely free text, but semi-structure can be recovered from the input in effort is used. Also, similar to other real-world problems, the input is noisy; the input files have been automatically converted from pdf to plain text using software programs that do not do a perfect job of conversion.

Your assignment is to create a basic search engine that will retrieve relevant papers from a corpus of 6,100 of such articles in plain-text format. To do this, you should have access to a computer with enough storage space to store the corpus (about 420 MB uncompressed). Alternatively, you can use the corpus directly on sf3/sunfire. To view the documents, you can go to ~rpnlpir/public_html/workspace/kanmy/5246/5246corpus, after logging into sf3/sunfire. The individual files are visible on the web, however, be warned: the directory listing of the files in this directory is already 400K in size. Note that the individual files of the corpus have their origin URL encoded as a filename.

For convenience we have pre-indexed the whole corpus using a standard IR system, Lucene. You can programmatically call the Lucene IR engine to return relevant filenames (given a query constructed by your program). You can then post-process these results to re-rank the results as needed.

Or you can choose to work from scratch and develop your own IR system based on the concepts of weighting and retrieval that we have discussed in class. This will be more difficult, and as such, will be graded more leniently than if you use the simpler (but recommended) method of post-processing the files from Lucene.

Either way, you will have to come up with a solution that 1) creates suitable queries given statements of information need, 2) deals with the semi-structured and noisy nature of the input.

You can search and retrieve the files from the Lucene index now. I've finished creating a web front end to the system. You can do it interactively, or programmatically from the web,e.g., http://www-appn.comp.nus.edu.sg/~rpnlpir/workspace/kanmy/5246/index.cgi?mode=process&q=information+retrieval. The output is in the format of: first line, query description; second line, number of hits, third and subsequent lines: raw score from Lucene and path to document.

If you want to use your account on sf3 to do the assignment, please do. You have two choices:

You can call the Java application that I demonstrated in class to get retrieval results. On sf3 you can run the following command: /usr/local/bin/java -cp /home/rsch/rpnlpir/public_html/workspace/kanmy/5246/lucene-core-2.0.1-dev.jar:/home/rsch/rpnlpir/public_html/workspace/kanmy/5246/lucene-demos-2.0.1-dev.jar org.apache.lucene.demo.SearchFiles -index /home/rsch/rpnlpir/public_html/workspace/kanmy/5246/index -query tfidf to get the output of Lucene for a specific query (here, tfidf). The output format is the same as in the web interface.
You can reference the Lucene index directly, if you want to use Lucene's API. It has a single indexed, unstored field called "contents". You can see the sample code that is used to create the sample application (which is the same as used by the web interface) in the SearchFiles.java code that is in the same directory.

Note that the SearchFiles.java code in the directory is a slightly hacked up version of the org.apache.lucene.demo.SearchFiles class shipped with the Lucene 2.0.0 distribution to allow command line queries (they are not the same!).

To assess your submissions, we will be using statements of information needs which your system needs to automatically convert to queries to find relevant documents. Your retrieval results will be assessed against answers that are compiled by the class. Each student will be assigned two needs to find relevant documents for in the corpus. The answers to these needs will each be a list of documents that each student compiles. The sum of the query answers will be used to grade system performance. Below are the 13 information needs that will be used to test all systems; another 2 have been withheld for private testing (to be only made public after the submission time). You can download these needs as a zip archive. Needs will be provided to your system as a 2-line input text file to be read from standard input, in which the first line is the title of the query (e.g., the bolded part) and the second line is the description.

Languages for Object Modeling: A relevant document will discuss standardized symbols and grammars for the language to describe objects. These include but are not limited to applications for software engineering and system design.
3D Modeling: A relevant document will describe methods to efficiently compute and construct 3D models from automatically acquired data (e.g., rangefinder data). Documents that discuss modeling from manually entered clean data are not relevant.
Medical image analysis: A relevant document will describe theory and/or algorithms for analyzing medical images. Visualization of medical data is not relevant to this topic. Performance measures with respect to a test set should be included if a standard data set is used.
Tutorials and Surveys: A relevant paper will be a simple introduction to a technical computer science topic, explaining the topic at a high-level, easy enough for non-specialists to understand.
Dimensionality Reduction: Papers that discuss any form of dimensionality reduction as their core topic are relevant. These include Latent Semantic Indexing, Singular Value Decomposition and any form of feature selection.
Abstracts, Panels and Demos: A relevant paper will not include technical details of the work but provide only a short abstract of the speaker's topic, panel discussion or system demonstration.
Upper bound estimation: Papers that establish a theoretical upper bound on complexity in terms of either space or time for a particular problem are relevant.
Decision Tree Use: Papers that describe some application of the machine learning method of decision trees are relevant. Papers that only use decision trees to compare against performance (i.e., as a baseline) are not considered relevant.
Large-Scale Digital Library Implementations: Papers that discuss, use or touch upon any large scale digital library or search engine system are considered relevant.
Papers from the UK: Any papers authored by people who work or are affiliated with institutions in the United Kingdom are considered relevant.
Datasets for text classification: Relevant papers will mention at least one dataset primarily used for supervised text classification (e.g., Reuters 21578).
Question Answering: A relevant paper will describe an approach to question answering. Papers on solely on document retrievel or passage retrieval are not considered relevant.
Weighting schemes for retrieval: Relevant papers will describe either a set of weighting schemes or a particular weighting scheme developed for text, image, or video retrieval.

Note that since this is an assignment that comprises at least 25% of your grade, I expect the level of effort for this assignment to be similar. You have five weeks to do this assignment. You should start immediately by finishing your judgments of which documents are relevant to which information needs. Hopefully this will give you an idea of how to code your search engine you can then follow on to complete the assignment.

What to turn in

You will upload an X.zip (where X is your matric ID, where all letters are in uppercase) archive by the due date, consisting of the following four sets of items. Note that I do not want to know who you are, with respect to grading assignments, so it is important that you try not to reveal your identity in your submission. Please follow the below instructions to the letter.

A summary file in plain text (not MS Word, not OpenOffice), giving your matric number and your NUS (u|g) prefixed email address (as the only form of ID) that describes your submission and the architecture for retrieval. In this file you also need to describe how your source code can be built and executed on sf3/sunfire. (filename: ReadmeX.txt, where X is your matric ID). You should include notes about the development of your submission, and special features that you developed to handle the structure of the queries and documents. If you decided to create your own IR library or use a IR library other than the Lucene web interface that is provided, you'll need to provide information on how I should index the collection or the URL to download to the indices that you used in your experiments. Warning! If you any resources, code or descriptions of search engines that are beyond the references on this page, you need to give proper credit and acknowledge the contribution of others. Please cite or acknowledge work that helped you that you did not do on your own. I will deduct the credit accordingly if needed. Failure to acknowledge your sources constitutes plagiarism and will be punished accordingly.
Two gold-standard lists of relevant documents for each of the two needs you were assigned find relevant documents for. You should assess relevance only on the basis of the text file you have to read (there will be lots of noise); please do not consult the original PDF file that can be reconstructed from the encoded URL in the filename. This should list the information need ID on the first line and the relevance judgement (+ or -) and filename of any relevant documents on the subsequent lines. These two should be separated by a space, see this example file. You should list at least fifty documents, where more relevant documents should be annotated if possible. These should be named nX-gold.txt, where X should be replaced by the need ID.
Fifteen files for the retrieval results for all 15 training queries. These should be in a similar form to the gold-standard files; the need ID on the first line and the filenames of relevant documents (in relevance order). These files should named nX.txt, where X should be replaced by the need ID. A sample file is here.
Your source code tree. These should be relatively well documented so that I can follow the logic of your code, with the help of the ReadmeX.txt file.

Please use a ZIP (not RAR, B2Z or TAR) utility to construct your submission. Do not include a directory in the submission to extract to (e.g., unzipping X.zip should give files like X.sum, not X/X.sum or submission/X.sum). Please use all capital letters when writing your matric number (matric numbers should start with U, NT, HT or HD for all students in this class). Your cooperation with the submission format will allow me to grade the assignment in a timely manner.

Grading scheme

Your grade will take into account 1) features used, 2) retrieval accuracy, 3) peer annotation, 4) documentation and 5) time efficiency. These factors are listed in order of importance/weighting to your final grade for the assignment. A bonus/additional component will be added to your grade in case you decide to create your own IR library. Warning -- I will be reading your code, so please make sure it is tidy and well documented.

Features used. This will be judged on the basis on your code and your summary file. What features do you use, whether you take advantage of the semi-structure in the input, how you modified the ranking score to get the final results.
Retrieval accuracy. This will be judged based on the pooled relevance judgments that all students turn in (the nX-gold.txt files in your submission. I will also include some additional test queries that you will not know ahead of time.
Peer Annotation. To judge #2 (retrieval accuracy) I will be looking at your annotated results to check for completeness and good manual retrieval. You can (and should) use both the Lucene IR interface (below) as well as other sources (CiteSeer, Google Scholar, DBLP) to assess whether your results are complete. Note that our corpus is only a tiny fraction of all scholarly documents on the web, there will be lots of relevant papers not found in our collection; these you do not have to worry about.
Documentation. How well the summary file and source code is documented. This will include how easy it is for me to run your software and the state of your code (is it readable, and the workflow well partitioned?). In your assignment submission, please do not assume that any environment variables (e.g., PATH and CLASSPATH) are necessarily correctly set.
Time efficiency of the system. As long as the system takes no longer than 30 seconds to produce a result for a need, it will be considered satisfactory. Students who choose to build their own IR library will be allowed up to 3 minutes to produce results for each need.

Due date and late policy

According to the syllabus, this homework is due on 26 Feb at 11:59 pm SGT. Submit your zip file to the IVLE workbin by this time. The late policy for submissions applies as per the policy set forth on the "Grading" page. For those of you who are doing the homework but needing to demo, you can sign up for a demo slot or if none of the slots work for you, you can arrange to demo separately with me. Note that students needing to demo, still have to submit their submission on time.

References

Apache Lucene - The most widely used, open-source IR library. I have indexed the input collection using this library. You can programmatically use this library to retrieve results from the collection which you can then post-process.
Here is a link to the corpus of files for the assignment. Warning, it's quite large (~107 MB). Expect a long download time. Unzipped it's about 350 MBs, consisting of over 6100 files. We'll be using this corpus again in the next assignment. The first link is to the original corpus (with ":" that Windows cannot handle), and the second to the renamed corpus (with ":" replaced by "zYz"). Note that if you use the second corpus, you should replace "zYz" with ":" in any output (gold standards and automatic system retrieval results).
[ 5246corpus.zip ] [ 5246corpusRenamed.zip ]
If you do decide to go it alone and build an IR system from scratch you will probably want to check on the first few chapters of the Managing Gigabytes textbook, which has technical details on how to built search engines. At least one copy is on RBR at the Science Library.
[ Check LINC for this book ]
Another newer IR library - Terrier - Terabyte Retreiver - from the folks at Glasgow.

Min-Yen Kan <kanmy@comp.nus.edu.sg> Created on: Sun Jan 21 16:31:48 2007 | Version: 1.0 | Last modified: Mon Mar 5 17:34:41 2007