(This homework assignment was recently revised - Sun Sep 28 16:34:44 SGT 2008)
Homework #1 - Blog Search
In this assignment, you will developing a search engine for blogs.
Like many inputs in other domains, the input is largely free text, but
semi-structure hints can be recovered from the input with some effort.
Also, similar to other real-world problems, the input is quite noisy;
the input files are HTML files that need to be post-processed to find
the appropriate content.
Your assignment is to create an advanced search engine that will
retrieve relevant English blog posts and comments on particular
topics. To make the assignment more closed in nature, we are
restricting the possible blog posts to ones published in 2007 and
published using Wordpress - a popular blogging software.
To do this assignment, you are to utilize Yahoo!'s Build your Own
Search Service (BOSS). Note: If you have access to other search
engine APIs (Google or Microsoft), you are not to use these APIs -- use
only the BOSS API. This is to guarantee that your system does not
retrieve "better" results merely because it is connected to
a different search engine.
You may work in teams of two or individually for this assignment.
There will be no adjustment to scores in factoring for whether the
assignment is done in a team or individually.
BOSS is a very simple system that allows you to gain programmatic
access to Yahoo! search engine via a simple HTTP GET. Please read the
documentation for more details (see "links" below). For
example to search for blogs containing "Google" from
Wordpress blog posts in 2007, we write a URL like:
http://boss.yahooapis.com/ysearch/web/v1/Google%20inurl:2007%20%22powered%20by%20wordpress%22?appid=appid
A few things to note about the above. Spaces and quotation marks
are URL escaped (to "%20" and "%22" respectively). If you include
other special characters you will have to escape them. Secondly, I've
added "inurl:2007 "powered by wordpress"
as part of the
query to limit my query to HTML documents that have the two phrases on
the page and have "2007" as part of the URL. These restrictions limit
the returns to blog entries in 2007 that are created by WordPress
(yes, this doesn't include all such blogs -- some sites that use
WordPress eliminate the "powered by wordpress" tagline). Finally, the
value of the appid attribute must be replaced by a valid appid. To
get a valid appid, you need to follow the instructions to obtain a
BOSS API key, as discussed in class. The output (shown here in XML
format) will contain results as would be shown on Yahoo!, which
include the snippet/abstract, date, size, title and various forms of
the the URL (see the BOSS documentation).
To do this assignment, you will have to come up with a programmatic
solution that 1) creates suitable queries given statements of
information need, 2) caters to the semi-structured and noisy nature of
blogs.
To assess your submissions, we will be using statements of
information needs which your system should to automatically convert to
queries to find relevant documents. Your retrieval results will be
assessed against answers that are compiled by the class. Each student
or team will be assigned two needs to find relevant blog posts for in
the corpus. The answers to these needs will each be a list of
documents that each student compiles. The sum of the query answers
will be used to grade system performance. Below are the 15
information needs that will be used to test all systems; another 2
have been withheld for private testing (to be only made public after
the submission time). You can (Revised on Sun Sep 28
16:31:10 SGT 2008) download these needs
as a zip archive (v2.0). Needs will be provided to your system as a
2-line input text file to be read from standard input (not to be
confused as with a command-line argument), in which the first line is
the title of the query (e.g., the bolded part) and the second line is
the description. Note that since you should keep your BOSS API key to
yourself, you need to have your program read a BOSS API key should be
read from a specific file (should be named "boss.key" in the top-level
of your submit directory).
- Virginia Tech Shootings: Relevant documents will express
opinions about the shootings and/or hypothesize on the motives
of the killer. Factual reports of the shootings are not
considered relevant.
- Traditional Chinese Medicine: Relevant documents will
discuss opinions or examples of where any form of traditional
Chinese medicine works or fails.
- Technologies for Solving Global Warming: Documents that
name and discuss different technologies or products that may
contribute to the reduction of global warming are considered
relevant.
- Who Wants To Be a Millionaire Hosts: Documents that give
an opinion about the accent, appearance or attitude of the host
of the program are considered relevant. Note that this show is
syndicated and is hosted by different hosts in different
countries.
- UCAS Application Process: Documents that give an opinion
about the UK higher education admissions process are relevant.
Posts that discuss how the system may be improved are also
relevant. Stories about individual's experiences are also
relevant. Factual pages that discuss the process from
administrative posts are not considered relevant.
- Water purification: Documents that discuss different
methods of water purification and treatment at the industrial
level (not consumer level) are considered relevant.
- Currency Exchange Rates: Documents that discuss exchange
rates between any two currencies will rise or fall are
relevant. Posts that discuss only historical fluctulations are
not relevant.
- Best Games for the Nintendo Wii: Documents that give the
authors' or commenter's opinions of their favorite games for
the Wii are considered relevant. Posts that give factual
information on sales rankings or rankings on a particular web
site are not relevant.
- Surfing sites in Australia: Documents that discuss
opinions on different sites for surfing in Australia are
considered relevant. Sites for other sea sports such as diving
and sailing are not considered relevant.
- Saddam Hussein: Documents that discuss the former Iraqi
president's role in the fate of his country and countrymen are
relevant. Documents that discuss opinions on his execution are
also relevant.
- Halo 3: Documents that discuss the game or its beta
version are considered relevant. Documents that primarily
discuss about previous installments of this game are
irrelevant.
- Hawaii Sights: Documents that
discuss different tourist's opinions of any of the Hawai'ian
islands' sights and sounds are considered relevant. Hotel and
restaurant recommendations by tourists by themselves are not
relevant.
- Airline Frequent Flier Programmes: Documents that
discuss the different benefits and restrictions of different
airline companies' frequent flier programs are relevant.
Documents where the poster just states that the poster belongs
to specific programme(s) are not relevant.
- Phones for SMS: Documents that discuss which mobile
phones are best for sending short messages are considered
relevant. Documents that just describe other aspects of a
mobile phone are considered irrelevant.
- Republican Nominations: Documents that discuss the
changes of possible candidates for the Republican nomination
for the US presidential race are relevant. Documents that
discuss congressional candidates or democratic candidates are
not relevant.
Note that since this is an assignment that comprises at least 25%
of your grade, I expect the level of effort for this assignment to be
similar. You have five weeks to do this assignment. You should start
immediately by finishing your judgments of which documents are
relevant to which information needs. Hopefully this will give you an
idea of how to code your search engine you can then follow on to
complete the assignment.
What to turn in
You will upload an HT0000000.zip (where HT0000000 is your matric
ID, where all letters are in uppercase) archive by the due date,
consisting of the following four sets of items. Please use a ZIP (not
RAR, B2Z or TAR) utility to construct your submission. Do not include
a subdirectories in the submission to extract to (e.g., unzipping
X.zip should give files like X.sum, not X/X.sum or
submission/X.sum). Please use all capital letters when writing your
matric number (matric numbers should start with U, NT, HT or HD for
all students in this class). Your cooperation with the submission
format will allow me to grade the assignment in a timely manner. Note
that I do not want to know who you are, with respect to grading
assignments, so it is important that you try not to reveal your
identity in your submission. Please follow the below instructions to
the letter.
- A summary file in plain text (not MS Word, not OpenOffice),
that describes your submission and the architecture for retrieval.
You should include your matric number and your NUS (u|g) prefixed
email address as the only form of ID. In this file you also need to
describe how your source code can be built and executed on
sf3/sunfire. If your submission cannot be run on sunfire, you'll
need to demonstrate it to me, sometime soon after the submission
date (by downloading your submission file and running it on your
system). The link to the demonstration sign up is
here;
demonstrations will be from 5-8pm on 7 Oct. You should
include notes about the development of your submission, and special
features that you developed to handle the structure of the queries
and documents (filename: ReadmeHT0000000.txt, where HT0000000 is
your matric ID). Warning! If you use any lexicons,
resources, code or algorithmic description that are beyond the
references on this page, you need to give proper credit and
acknowledge the contribution of others. Please cite or acknowledge
work that helped you that you did not do on your own. I will deduct
the credit accordingly, if applicable. Failure to acknowledge your
sources constitutes plagiarism and will be punished accordingly.
- Two gold-standard lists of relevant documents for each of the
two needs you were assigned find relevant documents for. You should
assess relevance only on the basis of the HTML file.
This should list the information need ID on the first
line and the relevance judgement (+ or -) and URL of any
relevant documents on the subsequent lines. These two should be
separated by a space, see this example
file. You should list at least fifty documents, where more
relevant documents should be annotated if possible. These should be
named
nX-gold.txt
, where X should be replaced by the
need ID. Documents that are not in English should be judged as
irrelevant.
- Fifteen files for the retrieval results for all 15 public
queries. These should be in a similar form to the gold-standard
files; the need ID on the first line and the URL of relevant
documents (in relevance order). These files should named
nX.txt
, where X should be replaced by the need ID. A
sample file is here. Each list should have fifty
results. I will generate the final two files for the test
queries during testing or have you generate them on the fly if a
demo is necessary.
- Your source code tree. These should be relatively well
documented so that I can follow the logic of your code, with the
help of the
ReadmeHT0000000.txt
file. Typing in "make"
or "ant" should build the appropriate code, such as an executable,
if needed. In your assignment submission, please do not assume that
any environment variables (e.g., PATH and CLASSPATH) are necessarily
correctly set. The executable file to run your system should be
named runHT0000000
(where HT0000000 is to be replaced
by your matric number, as above) and be set as executable (by you or
by your buildfile if it is compiled). In retrieving candidate blog
posts for your system to filter or rerank, you are required to add
the inurl:2007 and "powered by wordpress"
modifiers to
your Yahoo! BOSS queries.
Grading scheme
Your grade will take into account 1) features used, 2) retrieval
accuracy, 3) peer annotation, 4) documentation and 5) time efficiency.
These factors are listed in order of importance/weighting to your
final grade for the assignment. Warning -- I will be reading your
code, so please make sure it is tidy and well documented.
- [36 percent] Features used. This will be judged on the basis
on your code and your summary file. What features do you use,
whether you take advantage of the semi-structure in the input,
how you modified the ranking score to get the final results.
- [32 percent] Retrieval accuracy. This will be judged based on
the pooled relevance judgments that all students turn in (the
nX-gold.txt
files in your submission. I will also
include some additional test queries that you will not know
ahead of time.
- [20 percent] Peer Annotation. To judge #2 (retrieval accuracy)
I will be looking at your annotated results to check for
completeness and good manual retrieval. Note that our corpus is
only a tiny fraction of all blogs on the web, there will be
lots of relevant posts not found by using our criteria (Yahoo!
BOSS, inurl:2007 and "powered by wordpress"); these you do not
have to worry about.
- [10 percent] Documentation. How well the summary file and
source code is documented. This will include how easy it is
for me to run your software and the state of your code (is it
readable, and the workflow well partitioned?).
- [2 percent] Time efficiency of the system. As long as the
system takes no longer than 5 minutes to produce a result for a
need, it will be considered satisfactory.
Due date and late policy
According to the syllabus, this homework is due on 2 Oct at 11:59
pm SGT. Submit your zip file to the IVLE workbin by
this time. The late policy for submissions applies as per the policy
set forth on the "Grading" page.
References
- The BOSS homepage. Probably not as useful as the forum or the PDF documentation.
- WordPress - the free blogging platform, which we are targeting in our search.
- wget - an open-source command-line URL fetching utility. Also already installed on sunfire. Recommended for interacting with BOSS.
- The Sentiment AI Yahoo! Group, a group of researchers that look at identifying statements of opinion.
- A fairly recent opinion lexicon that you might use in your assignment.
- You might use Apache Lucene IR engine to process and retrieve locally downloaded documents.
Hints
- The bulk of this assignment is to think about how to best
utilize statements of information need and how to process them.
You'll need to figure out how to decide what parts of the
statements to keep and which to throw away or weight
negatively. You may want to combine the results of several
searches together using your own weighting, or incorporate
external knowledge from lexicons that you've created yourself
or mined from other resources.
- The assignment is also difficult technically as you have to
deal with XML or JSON output formats. Do plan to spend a bit
of time learning how to interpret this output format
programmatically using your preferred programming language.
Note that your programs have to run on sunfire, otherwise you
have to demonstrate that your programs run on a laptop that
only uses open-source software (private proprietary libraries
are prohibited for assignments).
- You can use Yahoo! BOSS to access lots of different types of
searches from Yahoo!, including spelling correction, news and
the general web. You can use such searches to glean auxiliary,
supplemental information which can be used to help you in
ranking candidates or in expanding your search.
- Yahoo! BOSS accepts all of the query syntax in Yahoo!'s web
search. Since most of you count yourselves as savvy web
searchers, you should be able to figure out how to use some of
the more esoteric searches to help you. If you're not so sure,
check here.
- You can use external sources in RPNLPIR (such as lexica like
WordNet or statistics like IDF statistics over the WebBase
corpus) to assist your programs. If you do plan to use
external resources, please be aware that they take time to
compile and preprocess into a useable form for you to take
advantage of.
- You may find downloading the documents yourself and processing
them may be helpful. If you do download documents, please note
that given the five minute deadline for each query, please make
sure you that your program doesn't hang if faced with a
recalcitrant page download.
Disclaimer
I'm not affiliated with Yahoo!, WordPress or other search engine
companies nor am I advocating their products. However, as blogs are a
current interest in IR and Yahoo! has an easy-to-use, non-limited API,
I have chosen to use these tools for our assignment.
Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Sun Jan 21 16:31:48 2007
| Version: 1.0
| Last modified:
Thu Oct 2 12:25:31 2008