In our final Homework 4, we will hold an information retrieval contest with real-world documents and queries: the problem of legal case retrieval. As described in lecture, legal retrieval is a case where structured documents are prevalent, so serve as a good testbed for a variety of different retrieval approaches.
Competition framework / Leaderboard
The indexing and query commands will use an (almost) identical input format to Homeworks #2 and #3, so that you need not modify any of your code to deal with command line processing. To recap:
Indexing: $ python index.py -i dataset-file
-d dictionary-file -p
postings-file
Searching: $ python search.py -d dictionary-file -p
postings-file -q
query-file -o output-file-of-results
The differences from Homeworks #2 and #3 are that
1) dataset-file
is a csv file containing all the documents to be
indexed, and 2) query-file
specifies a single query instead of a list of
queries.
However, significantly different from the previous homework, we will be using a legal corpus, provided by Intelllex, a company with origins partially from NUS.
Problem Statement: Given 1) a legal corpus (to be posted in IVLE) as the candidate document collection to retrieve from, and 2) a set of queries, return the list of the IDs of the relevant documents for each query, in sorted order or relevance. Your search engine should return the entire set of relevant documents (don't threshold to the top K relevant documents).
Your system should return the results for the queryquery-file
on a single line. Separate the IDs of different documents by a single space ' '. Return an empty line if no cases are relevant.
For this assignment, no holds are barred. You may use any type of preprocessing, post-processing, indexing and querying process you wish. You may wish to incorporate or use other python libraries or external resources; however, for python libraries, you'll have to include them with your submission properly -- I will not install new libraries to grade your submissions.
Intelllex, the company we are working with for this contest, is particularly interested in good IR systems for this problem and thus is cooperating with us for this homework assignments. They have provided the corpus (the documents are in the public domain, as is most government information, but the additional structuring the Intelllex team has done is their own work) and relevance judgments for a small number of queries. Teams that do well may be approached by Intelllex to see whether you'd like to work further on your project to help them for pay. Note: Your README may be read by the Intelllex team, but your code will not be given to their team to use; if they are interested in what you have done, you may opt to license your work to them.
The legal cases and the information needs have a particular structure in this task. Let's start with the information needs.
Queries:
In Intelllex's own system, searchers (lawyers or paralegals) use
the familiar search bar to issue free text or Boolean queries,
such as in the training query q1.txt
: quite phone call.
and q2.txt
: "fertility treatment" AND damages.
The keywords enclosed in double quotes are meant to be searched as a phrase.
The phrases in the queries are 2 or 3 words long, max; so you if you are able to
deal with phrasal queries, you can support them using n-word indices
or with positional indices. There are no ORs, NOTs or parentheses in the
queries issued by us so you can simplify your query parsing code if you choose.
Query Relevance Assessments:
The query is the first line of the query file. The file also comes
with (very few) relevance judgments, as subsequent lines. Each line
marks a positive (relevant) legal case identified within the corpus.
You should ideally have your system
rank documents from the positive list before any other documents. As
relevance judgments are expensive (lawyers assigned the judgments made
available to you), the bulk of the Intelllex corpus was not assessed
for their relevance. That is, there may be additional documents that
are relevant in the corpus that are not listed. However, your system will
be evaluated only on those documents that have been assessed as relevant.
We show the example for the above q1.txt
.
quiet phone call 6807771 3992148 4001247
The above indicates that there are 3 documents, with
document_id
s 6807771, 4001247 and 3992148, that are relevant to
the query.
Cases:
The legal cases are given in a csv file. Each case consists of 4 fields in the following format: "document_id","title","content","date_posted","court".
Below are snippets of a document, ID 6807771, a case relevant to the above example query:
"6807771","Burstow R v. Ireland, R v. [1997] UKHL 34","JISCBAILII_CASES_CRIME JISCBAILII_CASES_ENGLISH_LEGAL_SYSTEM Burstow R v. Ireland, R v. [1997] UKHL 34 (24th July, 1997) HOUSE OF LORDS Lord Goff of Chieveley Lord Slynn of Hadley Lord Steyn Lord Hope of Craighead Lord Hutton ... I would therefore answer the certified question in the affirmative and dismiss this appeal also.","1997-07-24 00:00:00","UK House of Lords"
You may choose to index or omit title
, court
, date_posted
depending on whether you think they are useful to assessing a case's relevance to the query.
More importantly, the content
has much structure itself. You may decide to
try to treat such work using preprocessing in your indexing if you
think you can capitalize on it. Note that different jurisdictions may
have differences in formatting, or even a different court's format
compared to others.
As introduced in Week 8, Zones are free text areas usually within a document that holds some special significance. Fields are more akin to database columns (in a database, we would actually make them columns), in that they take on a specific value from some (possibly infinite) enumerated set of values.
Along with the standard notion of a document as a ordered set of words, handling either / both zones and fields is important for certain aspects of case retrieval.
You might notice that many of the terms used in the text of the legal cases themselves do not overlap with the query terms used. This is known as the anomalous state of knowledge (ASK) problem or vocabulary mismatch, in which the searcher may use terminology that doesn't fit the documents' expression of the same semantics. A simple way that you can deal with the problem is to utilize query expansion (and pseudo relevance feedback).
As mentioned in the lecture, we can use manually created ontology (e.g., WordNet) or automatically generated thesauri (e.g., Co-occurrence Thesaurus) to identify related query terms.
In addition, we can perform a preliminary round of retrieval on the query terms. We can then assume that the top few documents are relevant and expand the query by 1) using the Rocchio formula, or 2) extracting important terms from these documents and adding them to the query. This is basically pseudo relevance feedback.
README.txt
,
index.py
, search.py
,
dictionary.txt
, and postings.txt
. Please
do not include the legal case
corpus.BONUS.docx
.
You are allowed to do this assignment individually or as a team of up to 4 students. There will be no difference in grading criteria if you do the assignment as a large team or individually. For the submission information below, simply replace any mention of a student number with the student numbers concatenated with a separating dash (e.g., A000000X-A000001Y-A000002Z). Please ensure you use the same identifier (student numbers in the same order) in all places that require a student number.
For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your student number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:
README.txt
: this is
a text only file that describes any information you want me to know
about your submission. You should
not include any identifiable information about your
assignment (your name, phone number, etc.) except your student number
and email (we need the email to contact you about your grade, please
use your A*******@u.nus.edu address, not your email alias). This is
to help you get an objective grade in your assignment, as we won't
associate student numbers with student names. You should use
the README.txt
template given to you in Homework #1 as
a start. In particular, you need to assert whether you followed
class policy for the assignment or not.BONUS.docx
: this
is a Word document that describes the information related to the query
expansion techniques you have implemented. You may include tables / diagrams
in this document.
These files will need to be suitably zipped in a single file called
<student number>.zip
. Please use a zip archive and
not tar.gz, bzip, rar or cab files. Make sure when the archive unzips
that all of the necessary files are found in a directory called
<student number>
. Upload the resulting zip file to
the IVLE workbin by the due date: 25 April 2018, 11:59:59 pm
SGT. Read the late policy if you're not sure about grade penalties for lateness.
The grading criteria for the assignment is below.
temp-*.*
in the current
directory for query processing.