CS 5246 - Text Processing and the Web

Updated on Tue Mar 20 01:34:27 GMT-8 2007 - Corrected what items to turn in.

Homework #2 - Single Document Summarization for Scientific Papers

In this assignment, you will be developing a generic (i.e., non-query based) single document summarization system for the same scientific documents in the cs 5246 scientific document collection. As you are dealing with the same input corpus, the same notes apply to this corpus, namely, that semi-structure can be often be recovered and that the input is noisy.

You will have to decide whether you want to construct an extractive or an abstractive summarization system. The evaluation criteria for both types of systems will be different; see below in the grading criteria.

Note that the textual input to your system will be one of the text files in the 5246 corpus. However, unlike the original documents, the input fed to your system will NOT include any abstracts, keywords and/or general terms.

Thanks to your midterm feedback, this homework may be done in pairs or individually. There will be no differentiation in grading for assignments done in pairs or individually. Note that if you do this assignment as a team, you are responsible for making sure the workload is balanced between both members. I will not be involved in balancing workload between team members. Please read the notes on grading again if you have any concerns. If you do the assignment as team, you should concatenate both your matric numbers together joined by an underscore ('_') in your submission.

What to turn in

You will upload an X.zip (where X is your matric ID, where all letters are in uppercase) archive by the due date, consisting of the following four sets of items. Note that I do not want to know who you are, with respect to grading assignments, so it is important that you try not to reveal your identity in your submission. Please follow the below instructions to the letter.

A summary file in plain text (not MS Word, not OpenOffice), giving your matric number and your NUS (u|g) prefixed email address (as the only form of ID) that describes your submission and the architecture for retrieval. In this file you also need to describe how your source code can be built and executed on sf3/sunfire. (filename: ReadmeX.txt, where X is your matric ID). You should include notes about the development of your submission, whether your system is abstractive or extractive, and special features that you developed to handle the structure of the queries and documents.
The code for your system: tested, compilable and runnable on sf3/sunfire, which is where I will run your code. Your code should read a file from standard input (which will be a file from the cs5246 corpus; inputted by "cat <filename> | yourProgram", and produce the summary on standard output. Note, you may open up temporary files in /tmp, and assume that only one instance of your code will be executing at any time. As the assignments will be tested and run on sf3/sunfire, your may choose to interface with other common tools or libraries on sf3/sunfire, as per Assignment #1.

Please use a ZIP (not RAR, B2Z or TAR) utility to construct your submission. Do not include a directory in the submission to extract to (e.g., unzipping X.zip should give files like X.txt, not X/X.txt or submission/X.txt). Please use all capital letters when writing your matric number (matric numbers should start with U, NT, HT or HD for all students in this class). Your cooperation with the submission format will allow me to grade the assignment in a timely manner.

Grading scheme

Your grade will take into account 1) features used, 2) summary quality, 3) documentation and 4) time efficiency. These factors are listed in order of importance/weighting to your final grade for the assignment. Warning -- I will be reading your code, so please make sure it is tidy and well documented.

Features used. This will be judged on the basis on your code and your summary file. What features do you use, whether you take advantage of the semi-structure in the input, how you modified
Corrected. Summary quality. Abstractive systems will be graded solely by quality of the summary by my reading of it; extractive systems will be graded partially by this method, another part will be graded by automated unigram overlap scoring (ROUGE).
Documentation. How well the Readme file and source code is documented. This will include how easy it is for me to run your software and the state of your code (is it readable, and the workflow well partitioned?). In your assignment submission, please do not assume that any environment variables (e.g., PATH and CLASSPATH) are necessarily correctly set.
Time efficiency of the system. As long as the system takes no longer than 30 seconds to produce a summary for the average document of 3000-6000 words, it will be considered satisfactory. Note that some of the documents in the corpus are quite long (examples are lengthy technical reports or Ph.D. theses), and your system will not be asked to summarize these types of documents.

Due date and late policy

According to the syllabus, this homework is due on 9 Apr at 11:59 pm SGT. The late policy for submissions applies as per the policy set forth on the "Grading" page.

References

Apache Lucene - The most widely used, open-source IR library. I have indexed the input collection using this library. You can programmatically use this library to retrieve results from the collection which you can then post-process.
Here is a link to the corpus of files for the assignment. It is the same as the one used for Assignment #1. Warning, it's quite large (~107 MB). Expect a long download time. Unzipped it's about 350 MBs, consisting of over 6100 files. We'll be using this corpus again in the next assignment. Note that there are problems unzipping under Windows due to restrictions in having ":" in filenames.
[ 5246corpus.zip ]
You may want to use a sentence splitting tool, especially if you choose to do extractive summarization. You can choose to write your own or use a tool, such as MXTERMINATOR; a package with a Java interface. MXTerminator is installed on sf3/sunfire as part of the rpnlpir group research account, available to all on sf3/sunfire.
You may want to use a machine learning package to have your tool learn some rules from your annotated data, if any. You can try the WEKA (good for java heads) or SVMlight (good for command line users) packages. Both are pretty easy to use.
To evaluate (extractive) summaries you can use the ROUGE script. Here is the description. There's a working copy of it in the rpnlpir account on sf3, under the path ~rpnlpir/tools/evalTools/rouge/RELEASE-1.5.5. ROUGE is a single file perl script (you'll need perl to run it. Note that you cannot use this tool outside of SoC and class; it is protected by licensing agreements; you can sign your own licensing agreements if you want to use it outside of class.
You may want to look at the MEAD summarization toolkit if you find sentence-based extractive summarization your cup of tea for the homework assignment.
I've coded a simple plain text section finder in perl that will attempt to separate the abstract, references, the main body of text and anything else into separate files. You can use this sectionSeparator.pl script if you'd like. I have used it to create some sample input and output files for you (*.main_body.txt and *.abstract.txt). Right click to save and then you'll have to make it executable.
Download some sample abstracts and their corresponding main body texts from the 5246 corpus. 5246sectioned.zip or with ":" replaced by "zYz" 5246sectionedRenamed.zip

Note / Warning: If you use any of these resources (especially software), you'll have to cite it and be explicit about what you did to change it or customize it for the task in our assignment. Simply learning how to use a software does not constitute a worthy homework assignment submission.

Min-Yen Kan <kanmy@comp.nus.edu.sg> Created on: Sun Jan 21 16:31:48 2007 | Version: 1.0 | Last modified: Tue Mar 20 01:38:04 2007