Updated:
Wed Jan 31 17:03:05 GMT-8 2007
: Added new renamed corpus link.
Homework #1 - Search Engine for Scientific Papers
In this assignment, you will developing a search engine for
scientific papers. Like many inputs in other domains, the input is
largely free text, but semi-structure can be recovered from the input
in effort is used. Also, similar to other real-world problems, the
input is noisy; the input files have been automatically converted from
pdf to plain text using software programs that do not do a perfect job
of conversion.
Your assignment is to create a basic search engine that will
retrieve relevant papers from a corpus of 6,100 of such articles in
plain-text format. To do this, you should have access to a computer
with enough storage space to store the corpus (about 420 MB
uncompressed). Alternatively, you can use the corpus directly on
sf3/sunfire. To view the documents, you can go to
~rpnlpir/public_html/workspace/kanmy/5246/5246corpus, after logging
into sf3/sunfire. The individual files are visible on the web,
however, be warned: the directory
listing of the files in this directory is already 400K in size.
Note that the individual files of the corpus have their origin URL
encoded as a filename.
For convenience we have pre-indexed the whole corpus using a
standard IR system, Lucene. You can programmatically call the Lucene
IR engine to return relevant filenames (given a query constructed by
your program). You can then post-process these results to re-rank the
results as needed.
Or you can choose to work from scratch and develop your own IR
system based on the concepts of weighting and retrieval that we have
discussed in class. This will be more difficult, and as such, will be
graded more leniently than if you use the simpler (but recommended)
method of post-processing the files from Lucene.
Either way, you will have to come up with a solution that 1)
creates suitable queries given statements of information need, 2)
deals with the semi-structured and noisy nature of the input.
You can search and retrieve the files from the
Lucene index now. I've finished creating a web front end to
the system. You can do it interactively,
or programmatically from the web,e.g., http://www-appn.comp.nus.edu.sg/~rpnlpir/workspace/kanmy/5246/index.cgi?mode=process&q=information+retrieval.
The output is in the format of: first line, query description; second
line, number of hits, third and subsequent lines: raw score from
Lucene and path to document.
If you want to use your account on sf3 to do the assignment,
please do. You have two choices:
- You can call the Java application that I demonstrated in
class to get retrieval results. On sf3 you can run the
following command:
/usr/local/bin/java -cp /home/rsch/rpnlpir/public_html/workspace/kanmy/5246/lucene-core-2.0.1-dev.jar:/home/rsch/rpnlpir/public_html/workspace/kanmy/5246/lucene-demos-2.0.1-dev.jar org.apache.lucene.demo.SearchFiles -index /home/rsch/rpnlpir/public_html/workspace/kanmy/5246/index -query tfidf
to get the output of Lucene for a specific query (here,
tfidf
).
The output format is the same as in the web interface.
- You can reference the Lucene index directly, if you want
to use Lucene's API. It has a single indexed, unstored field
called "contents". You can see the sample code that is used to
create the sample application (which is the same as used by the web
interface) in the
SearchFiles.java code that is in the same directory.
Note that the SearchFiles.java code in the directory is a
slightly hacked up version of the
org.apache.lucene.demo.SearchFiles class shipped with the
Lucene 2.0.0 distribution to allow command line queries (they
are not the same!).
To assess your submissions, we will be using statements of
information needs which your system needs to automatically convert to
queries to find relevant documents. Your retrieval results will be
assessed against answers that are compiled by the class. Each student
will be assigned two needs to find relevant documents for in the
corpus. The answers to these needs will each be a list of documents
that each student compiles. The sum of the query answers will be used
to grade system performance. Below are the 13 information needs that
will be used to test all systems; another 2 have been withheld for
private testing (to be only made public after the submission time).
You can download these needs as a zip archive.
Needs will be provided to your system as a 2-line input text file to
be read from standard input, in which the first line is the title of
the query (e.g., the bolded part) and the second line is the
description.
- Languages for Object Modeling:
A relevant document will discuss standardized symbols and grammars for
the language to describe objects. These include but are not limited
to applications for software engineering and system design.
- 3D Modeling:
A relevant document will describe methods to efficiently compute and
construct 3D models from automatically acquired data (e.g.,
rangefinder data). Documents that discuss modeling from manually
entered clean data are not relevant.
- Medical image analysis:
A relevant document will describe theory and/or algorithms for
analyzing medical images. Visualization of medical data is not
relevant to this topic. Performance measures with respect to a test
set should be included if a standard data set is used.
- Tutorials and Surveys:
A relevant paper will be a simple introduction to a technical computer
science topic, explaining the topic at a high-level, easy enough for
non-specialists to understand.
- Dimensionality Reduction:
Papers that discuss any form of dimensionality reduction as their core
topic are relevant. These include Latent Semantic Indexing, Singular Value
Decomposition and any form of feature selection.
- Abstracts, Panels and Demos:
A relevant paper will not include technical details of the work but
provide only a short abstract of the speaker's topic, panel discussion
or system demonstration.
- Upper bound estimation:
Papers that establish a theoretical upper bound on complexity in terms
of either space or time for a particular problem are relevant.
- Decision Tree Use:
Papers that describe some application of the machine learning method
of decision trees are relevant. Papers that only use decision trees to
compare against performance (i.e., as a baseline) are not considered
relevant.
- Large-Scale Digital Library Implementations:
Papers that discuss, use or touch upon any large scale digital library
or search engine system are considered relevant.
- Papers from the UK:
Any papers authored by people who work or are affiliated with
institutions in the United Kingdom are considered relevant.
- Datasets for text classification:
Relevant papers will mention at least one dataset primarily used for
supervised text classification (e.g., Reuters 21578).
- Question Answering:
A relevant paper will describe an approach to question answering.
Papers on solely on document retrievel or passage retrieval are not
considered relevant.
- Weighting schemes for retrieval:
Relevant papers will describe either a set of weighting schemes or a
particular weighting scheme developed for text, image, or video
retrieval.
Note that since this is an assignment that comprises at least 25%
of your grade, I expect the level of effort for this assignment to be
similar. You have five weeks to do this assignment. You should start
immediately by finishing your judgments of which documents are
relevant to which information needs. Hopefully this will give you an
idea of how to code your search engine you can then follow on to
complete the assignment.
What to turn in
You will upload an X.zip (where X is your matric ID, where all
letters are in uppercase) archive by the due date, consisting of the
following four sets of items. Note that I do not want to know
who you are, with respect to grading assignments, so it is important
that you try not to reveal your identity in your submission. Please
follow the below instructions to the letter.
- A summary file in plain text (not MS Word, not OpenOffice),
giving your matric number and your NUS (u|g) prefixed email address
(as the only form of ID) that describes your submission and the
architecture for retrieval. In this file you also need to describe
how your source code can be built and executed on sf3/sunfire.
(filename: ReadmeX.txt, where X is your matric ID). You should include
notes about the development of your submission, and special features
that you developed to handle the structure of the queries and
documents. If you decided to create your own IR library or use a IR
library other than the Lucene web interface that is provided, you'll
need to provide information on how I should index the collection or
the URL to download to the indices that you used in your experiments.
Warning! If you any resources, code or descriptions of
search engines that are beyond the references on this page, you need
to give proper credit and acknowledge the contribution of others.
Please cite or acknowledge work that helped you that you did not do on
your own. I will deduct the credit accordingly if needed. Failure to
acknowledge your sources constitutes plagiarism and will be punished
accordingly.
- Two gold-standard lists of relevant documents for each of the
two needs you were assigned find relevant documents for. You should
assess relevance only on the basis of the text file you have to read
(there will be lots of noise); please do not consult the original PDF
file that can be reconstructed from the encoded URL in the filename.
This should list the information need ID on the first line and the
relevance judgement (+ or -) and filename of any relevant documents on
the subsequent lines. These two should be separated by a space, see
this example file. You should list at least
fifty documents, where more relevant documents should be annotated if
possible. These should be named
nX-gold.txt
, where X
should be replaced by the need ID.
- Fifteen files for the retrieval results for all 15 training
queries. These should be in a similar form to the gold-standard
files; the need ID on the first line and the filenames of relevant
documents (in relevance order). These files should named
nX.txt
, where X should be replaced by the need ID. A
sample file is here.
- Your source code tree. These should be relatively well
documented so that I can follow the logic of your code, with the help
of the
ReadmeX.txt
file.
Please use a ZIP (not RAR, B2Z or TAR) utility to construct your
submission. Do not include a directory in the submission to extract to
(e.g., unzipping X.zip should give files like X.sum, not X/X.sum or
submission/X.sum). Please use all capital letters when writing your
matric number (matric numbers should start with U, NT, HT or HD for
all students in this class). Your cooperation with the submission
format will allow me to grade the assignment in a timely manner.
Grading scheme
Your grade will take into account 1) features used, 2) retrieval
accuracy, 3) peer annotation, 4) documentation and 5) time efficiency.
These factors are listed in order of importance/weighting to your
final grade for the assignment. A bonus/additional component will be
added to your grade in case you decide to create your own IR library.
Warning -- I will be reading your code, so please make sure it is tidy
and well documented.
- Features used. This will be judged on the basis on your code
and your summary file. What features do you use, whether you
take advantage of the semi-structure in the input, how you
modified the ranking score to get the final results.
- Retrieval accuracy. This will be judged based on the pooled
relevance judgments that all students turn in (the
nX-gold.txt
files in your submission. I will also
include some additional test queries that you will not know
ahead of time.
- Peer Annotation. To judge #2 (retrieval accuracy) I will be
looking at your annotated results to check for completeness and
good manual retrieval. You can (and should) use both the
Lucene IR interface (below) as well as other sources (CiteSeer,
Google Scholar, DBLP) to assess whether your results are
complete. Note that our corpus is only a tiny fraction of all
scholarly documents on the web, there will be lots of relevant
papers not found in our collection; these you do not have to
worry about.
- Documentation. How well the summary file and source code is
documented. This will include how easy it is for me to run
your software and the state of your code (is it readable, and
the workflow well partitioned?).
In your assignment submission,
please do not assume that any
environment variables (e.g., PATH and CLASSPATH) are
necessarily correctly set.
- Time efficiency of the system. As long as the system takes no
longer than 30 seconds to produce a result for a need, it will be
considered satisfactory. Students who choose to build their
own IR library will be allowed up to 3 minutes to produce
results for each need.
Due date and late policy
According to the syllabus, this homework is due on 26 Feb at 11:59
pm SGT. Submit your zip file to the IVLE workbin by
this time. The late policy for submissions applies as per the policy
set forth on the "Grading" page. For those of you who are doing the
homework but needing to demo, you can sign up for a demo slot or if none of the slots work for
you, you can arrange to demo separately with me. Note that students
needing to demo, still have to submit their submission on time.
References
- Apache Lucene - The most
widely used, open-source IR library. I have indexed the input
collection using this library. You can programmatically use
this library to retrieve results from the collection which you
can then post-process.
- Here is a link to the corpus of files for the assignment.
Warning, it's quite large (~107 MB). Expect a long download
time. Unzipped it's about 350 MBs, consisting of over 6100
files. We'll be using this corpus
again in the next assignment. The first link is to the
original corpus (with ":" that Windows cannot handle), and the
second to the renamed corpus (with ":" replaced by "zYz").
Note that if you use the second corpus, you should replace
"zYz" with ":" in any output (gold standards and automatic
system retrieval results).
[ 5246corpus.zip ]
[ 5246corpusRenamed.zip ]
- If you do decide to go it alone and build an IR system from
scratch you will probably want to check on the first few chapters of
the Managing Gigabytes textbook, which has technical
details on how to built search engines. At least one copy is on RBR at the Science Library.
[ Check LINC for this book ]
- Another newer IR library - Terrier - Terabyte
Retreiver - from the folks at Glasgow.
Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Sun Jan 21 16:31:48 2007
| Version: 1.0
| Last modified:
Mon Mar 5 17:34:41 2007