Updated on
Tue Mar 20 01:34:27 GMT-8 2007
- Corrected what items to turn in.
Homework #2 - Single Document Summarization for Scientific Papers
In this assignment, you will be developing a generic (i.e.,
non-query based) single document summarization system for the same
scientific documents in the cs 5246 scientific document collection.
As you are dealing with the same input corpus, the same notes apply to
this corpus, namely, that semi-structure can be often be recovered and
that the input is noisy.
You will have to decide whether you want to construct an extractive
or an abstractive summarization system. The evaluation criteria for
both types of systems will be different; see below in the
grading criteria.
Note that the textual input to your system will be one of the text
files in the 5246 corpus. However, unlike the original documents, the
input fed to your system will NOT include any abstracts, keywords
and/or general terms.
Thanks to your midterm feedback, this homework may be done in pairs
or individually. There will be no differentiation in grading for
assignments done in pairs or individually. Note that if you do
this assignment as a team, you are responsible for making sure the
workload is balanced between both members. I will not be involved
in balancing workload between team members. Please read the notes on
grading again if you have any concerns. If
you do the assignment as team, you should concatenate both your matric
numbers together joined by an underscore ('_') in your submission.
What to turn in
You will upload an X.zip (where X is your matric ID, where all
letters are in uppercase) archive by the due date, consisting of the
following four sets of items. Note that I do not want to know
who you are, with respect to grading assignments, so it is important
that you try not to reveal your identity in your submission. Please
follow the below instructions to the letter.
- A summary file in plain text (not MS Word, not OpenOffice),
giving your matric number and your NUS (u|g) prefixed email address
(as the only form of ID) that describes your submission and the
architecture for retrieval. In this file you also need to describe
how your source code can be built and executed on sf3/sunfire.
(filename: ReadmeX.txt, where X is your matric ID). You should
include notes about the development of your submission, whether your
system is abstractive or extractive, and special features that you
developed to handle the structure of the queries and documents.
- The code for your system: tested, compilable and runnable on
sf3/sunfire, which is where I will run your code. Your code should
read a file from standard input (which will be a file from the cs5246
corpus; inputted by "cat <filename> | yourProgram", and produce the
summary on standard output. Note, you may open up temporary files in
/tmp, and assume that only one instance of your code will be executing
at any time. As the assignments will be tested and run on
sf3/sunfire, your may choose to interface with other common tools or
libraries on sf3/sunfire, as per Assignment #1.
Please use a ZIP (not RAR, B2Z or TAR) utility to construct your
submission. Do not include a directory in the submission to extract to
(e.g., unzipping X.zip should give files like X.txt, not X/X.txt or
submission/X.txt). Please use all capital letters when writing your
matric number (matric numbers should start with U, NT, HT or HD for
all students in this class). Your cooperation with the submission
format will allow me to grade the assignment in a timely manner.
Grading scheme
Your grade will take into account 1) features used, 2) summary
quality, 3) documentation and 4) time efficiency. These factors are
listed in order of importance/weighting to your final grade for the
assignment. Warning -- I will
be reading your code, so please make sure it is tidy and well
documented.
- Features used. This will be judged on the basis on your code
and your summary file. What features do you use, whether you
take advantage of the semi-structure in the input, how you
modified
- Corrected. Summary quality. Abstractive
systems will be graded solely by
quality of the summary by my reading of it; extractive systems
will be graded partially by this method, another part will be
graded by automated unigram overlap scoring (ROUGE).
- Documentation. How well the Readme file and source code is
documented. This will include how easy it is for me to run
your software and the state of your code (is it readable, and
the workflow well partitioned?). In your assignment submission,
please do not assume that any environment variables (e.g., PATH
and CLASSPATH) are necessarily correctly set.
- Time efficiency of the system. As long as the system takes no
longer than 30 seconds to produce a summary for the average
document of 3000-6000 words, it will be considered
satisfactory. Note that some
of the documents in the corpus are quite long (examples are
lengthy technical reports or Ph.D. theses), and your system
will not be asked to summarize these types of documents.
Due date and late policy
According to the syllabus, this homework is due on 9 Apr at 11:59
pm SGT. The late policy for submissions applies as per the policy set
forth on the "Grading" page.
References
- Apache Lucene - The most
widely used, open-source IR library. I have indexed the input
collection using this library. You can programmatically use
this library to retrieve results from the collection which you
can then post-process.
- Here is a link to the corpus of files for the assignment. It
is the same as the one used for Assignment #1.
Warning, it's quite large (~107 MB). Expect a long download
time. Unzipped it's about 350 MBs, consisting of over 6100
files. We'll be using this corpus
again in the next assignment. Note that there are problems
unzipping under Windows due to restrictions in having ":" in
filenames.
[ 5246corpus.zip ]
- You may want to use a sentence splitting tool, especially if
you choose to do extractive summarization. You can choose to
write your own or use a tool, such as MXTERMINATOR;
a package with a Java interface. MXTerminator is installed on
sf3/sunfire as part of the rpnlpir group research account, available
to all on sf3/sunfire.
- You may want to use a machine learning
package to have your tool learn some rules from your annotated
data, if any. You can try the WEKA (good for
java heads) or SVMlight (good for
command line users) packages. Both
are pretty easy to use.
- To evaluate (extractive) summaries you can use
the ROUGE script. Here is
the description. There's a working copy of it in the rpnlpir
account on sf3, under the path
~rpnlpir/tools/evalTools/rouge/RELEASE-1.5.5. ROUGE is a single file
perl script (you'll need perl to run it. Note that you cannot use
this tool outside of SoC and class; it is protected by licensing
agreements; you can sign your own licensing agreements if you want to
use it outside of class.
- You may want to look at the MEAD summarization
toolkit if you find sentence-based extractive summarization your
cup of tea for the homework assignment.
- I've coded a simple plain text section finder in perl that will
attempt to separate the abstract, references, the main body of text
and anything else into separate files. You can use
this
sectionSeparator.pl script if you'd like. I have used it to create
some sample input and output files for you (*.main_body.txt and
*.abstract.txt). Right click to save and then you'll have to make it executable.
- Download some sample abstracts and their
corresponding main body texts from the 5246 corpus.
5246sectioned.zip
or with ":" replaced by "zYz" 5246sectionedRenamed.zip
Note / Warning: If you use any of these resources (especially
software), you'll have to cite it and be explicit about what you did
to change it or customize it for the task in our assignment. Simply
learning how to use a software does not constitute a worthy homework
assignment submission.
Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Sun Jan 21 16:31:48 2007
| Version: 1.0
| Last modified:
Tue Mar 20 01:38:04 2007