In our final Homework 4, we will hold an information retrieval contest with real-world documents and queries: the problem of legal case retrieval. As described in lecture, legal retrieval is a case where structured documents are prevalent, so serve as a good testbed for a variety of different retrieval approaches.
updated (now accepting submissions!) Jump to the competition framework, the current leaderboard.
The indexing and query commands will use an (almost) identical input format to Homeworks #2 and #3, so that you need not modify any of your code to deal with command line processing. To recap:
Indexing: $ python index.py -i directory-of-documents
-d dictionary-file -p
postings-file
Searching: $ python search.py -d dictionary-file -p
postings-file -q
query-file -o output-file-of-results
The difference from Homeworks #2 and #3 is that the
query-file
specifies a single query, and not a list of
queries.
However, significantly different from the previous homework, we will be using a legal corpus, provided by Intelllex, a company with origins partially from NUS (Disclaimer: My research group may be helping with their work outside of our class homework assignment on better legal search technologies).
Problem Statement: Given 1) a legal corpus (to be posted in chunks to IVLE) as the candidate document collection to retrieve from, and 2) a set of queries, return the list of the IDs of the relevant documents for each query, in sorted order or relevance. Your search engine should return the entire set of relevant documents (don't threshold to the top K relevant documents).
Your system should return the results for the queryquery-file
on a single line. Separate the IDs of different documents by a single space ' '. Return an empty line if no cases are relevant.
For this assignment, no holds are barred. You may use any type of preprocessing, post-processing, indexing and querying process you wish. You may wish to incorporate or use other python libraries or external resources; however, for python libraries, you'll have to include them with your submission properly -- I will not install new libraries to grade your submissions.
Intelllex, the company we are working with for this contest, is particularly interested in good IR systems for this problem and thus is cooperating with us for this homework assignments. They have provided the corpus (the documents are in the public domain, as is most government information, but the additional structuring the Intelllex team has done is their own work) and relevance judgments for a small number of queries. Teams that do well may be approached by Intelllex to see whether you'd like to work further on your project to help them for pay. Note: Your README will be read by both Min and may be read by the Intelllex team, but your code will not be given to their team to use; if they are interested in what you have done, you may opt to license your work to them.
The legal cases and the information needs have a particular structure in this task. Let's start with the information needs.
Queries:
In Intelllex's own system, searchers (lawyers or paralegals) use
the familiar search bar to issue a Boolean phrasal query, such as in
the training query q1.txt
: "intentional tort" AND
"remoteness of damage". You'll notice that the training and test
queries are simple boolean AND queries with keywords/keyphrases in
quotes -- even single words are in quotes -- e.g.,
q2.txt
, which has the contents "murder" AND
"provocation" AND "loss of self-control". Phrasal queries are 2
or 3 words long, max; so you if you are able to deal with phrasal
queries, you can support them using n-word indices or with positional
indices. There are no ORs, NOTs or parentheses in the queries issued
by us so you can simplify your query parsing code if you choose.
Query Relevance Assessments:
The query is the first line of the query file. updated The sample training query files also come with (very few) relevance judgments, as subsequent lines, which are only available to you in the sample queries (the test queries won't supply you with these answers for obvious reasons).
Each line marks a positive (relevant) or negative (irrelevant)
legal case identified within the corpus. You should ideally have your
system rank documents from the positive list before any other
documents. As relevance judgments are expensive (lawyers assigned the
judgments made available to you), the bulk of the Intelllex corpus was
not assessed for their relevance. That is, there may be additional
documents that are relevant in the corpus that are not listed.
However, you will be assessed only on those that show up in the
positive. We show the example for the above q1.txt
.
"intentional tort" AND "remoteness of damage" + 3110812 + 3110916 + 2750632 - 2154455updatedThe above indicates that there are three documents, with
document_id
s of 3110812,
3110916 and 2750632, that are relevant to the query; and one document,
2154455, that is known to be irrelevant.
Cases:
As per the guest lecture, legal cases are structured documents. For the purposes of our assignment, we are going to use an XML representation of a relevant case. Below are snippets of a document, ID 3110812, a case from the Singapore jurisdiction, relevant to the above example query:
<str name="document_id">3110812</str> <str name="title">Ngiam Kong Seng and another v Lim Chiew Hock[2008] 3 SLR(R) 674; [2008] SGCA 23</str> <str name="url">http://www.singaporelaw.sg/sglaw/laws-of-singapore/case-law/cases-in-articles/negligence/1608-ngiam-kong-seng-and-another-v-lim-chiew-hock-2008-3-slr-r-674-2008-sgca-23</str> <str name="content">Ngiam Kong Seng and another v Lim Chiew Hock Ngiam Kong Seng and another v Lim Chiew Hock[2008] 3 SLR(R) 674; [2008] SGCA 23 Suit No : Civil Appeal No 38 of 2007 Decision Date : 29 May 2008 Court : Court of Appeal Coram : Chan Sek Keong CJ, Andrew Phang Boon Leong JA and V K Rajah JA Counsel : Cecilia Hendrick and Wee Ai Tin Jayne (Kelvin Chia Partnership) for the appellants; Quentin Loh SC (Rajah & Tann) and Anthony Wee (United Legal Alliance LLC) for the respondent. Tort â Negligence â Duty of care â Psychiatric harm â Applicable test to determine existence of duty of care â Application of two-stage test set out in Spandeck Engineering (S) Pte Ltd v Defence Science & Technology Agency â First stage involving consideration of whether there was sufficient legal proximity with three factors set out by Lord Wilberforce in McLoughlin v O'Brian playing important role â Second stage involving consideration of whether there are any public policy factors militating against the court imposing duty of care â Threshold considerations of recognisable psychiatric illness and factual foreseeability Tort â Negligence â Duty of care â Psychiatric harm â Applicable test to determine existence of duty of care â Whether type of damage claimed should result in different test from two-stage test set out in Spandeck Engineering (S) Pte Ltd v Defence Science & Technology Agency â Application of two-stage test irrespective of type of damage claimed Tort â Negligence â Duty of care â Whether tortfeasor owing duty of care not to cause psychiatric harm â Whether communication of matters relating to accident sufficient to found duty of care Facts The first appellant, who was riding a motorcycle, was involved in an accident which was allegedly caused by the respondent, who was driving a taxi. As a result of the accident, the first appellant sustained severe injuries which rendered him a tetraplegic. Both immediately after and during the period following the accident, the respondent represented himself to be a helpful bystander who had rendered assistance to the first appellant. The second appellant was, accordingly, led to believe that the respondent was a good Samaritan and developed feelings of gratitude towards him. The inquiries by the appellants' solicitors eventually led to the second appellant being told that the respondent had been involved in the accident. She subsequently suffered from major depression and suicidal tendencies resulting from, she claimed, having been "betrayed". The appellants eventually started an action in negligence. ... Held, dismissing the appeal: (1) The appeal, as far as the first appellant was concerned, should be dismissed. There was no basis for finding that the trial judge's decision was plainly wrong as, inter alia, there were inconsistencies in the first appellant's evidence; the evidence from the only independent eyewitness (ie, the respondent's passenger at the material time) did not support the first appellant's evidence; there was a lack of objective evidence in support of the first appellant's case: at [14] to [18], [148]. ... [Observation: Whether or not reform of the tort of negligence vis-a-vis psychiatric harm was to be effected was one that was best left to t\ he Legislature. Many issues which had to be grappled with laid wholly outside the expertise of the court and related to policy matters which required the Legislature's con\ sideration: at [120].] Case(s) referred to Alcock v Chief Constable of South Yorkshire Police [1992] 1 AC 310 (refd) Anns v Merton London Borough Council [1978] AC 728 (refd) Barnard v Santam Bpk 1999 (1) SA 202 (refd) Bolitho v City and Hackney Health Authority [1998] AC 232 (refd) Bourhill v Young [1943] AC 92 (refd) Brown v The Mount Barker Soldiers' Hospital Incorporated [1934] SASR 128 (refd) Caparo Industries Plc v Dickman [1990] 2 AC 605 (refd) Childs v Desormeaux (2006) 266 DLR (4th) 257 (refd) Cooper v Hobart (2001) 206 DLR (4th) 193 (refd) Corr v IBC Vehicles Ltd [2008] 2 WLR 499 (refd) Council of the Shire of Sutherland, The v Heyman (1984-1985) 157 CLR 424 (refd) Customs and Excise Commissioners v Barclays Bank plc [2007] 1 AC 181 (refd) ... [Editorial note: This was an appeal from the decision of the High Court in [2007] SGHC 38 .] 29 May 2008 Andrew Phang Boon Leong JA (delivering the grounds of decision of the court): Introduction 1 This was an appeal against the decision of the trial judge ("the Judge"), who dismissed the appellants' claims for damages (see Ngi\ am Kong Seng v CityCab Pte Ltd [2007] SGHC 38 ) ("the GD")). We dismissed the appeal and now give the reasons for our decision. ... The facts ... Conclusion 148 Our consideration of the evidence and the parties' arguments demonstrated that the Judge was not plainly wrong in arriving at her decision that the respondent was not responsible for the first appellant's injuries. In so far as the second appellant was concerned, we found that there was no duty of care owed by the respondent to her. In the circumstances, we dismissed the appeal with costs and the usual consequential orders. 149 That the Accident was a tragic one is undeniable, and we have the utmost sympathy for the very real plight of both the appellants. In this regard, we must commend Mr Quentin Loh SC, who was lead counsel for the respondent. At the conclusion of this appeal, he not only expressed his sympathy for the appellants, but also assured the court that, although he had no instructions as to costs, he would consult with his client and recommend that costs not be enforced against the appellants. He added that, if necessary, his own costs would be reduced. Reported by Nathaniel Khng.</str> <str name="source_type">primary</str> <str name="content_type">case</str> <str name="court">SGCA</str> <date name="date_posted">2008-05-29T00:00:00Z</date> <arr name="jurisdiction"> <str>SG</str> </arr> <arr name="tag"> <str>Negligence</str> </arr> <str name="domain">singaporelaw.sg</str> <bool name="show">true</bool> <bool name="hide_url">false</bool> <bool name="hide_blurb">false</bool> <bool name="modified">true</bool> <date name="date_modified">2017-01-26T00:00:00Z</date> <long name="_version_">1560853972779008000</long>
You will notice that there are a lot of fields in the
case. However, not all fields things are relevant to assessing a
case's relevance to the query (and thus you may not want to index
them), but are included for the sake of completeness. Importantly the
content
has much structure itself, consisting of
subheaders, numbered statements and substatements. You may decide to
try to treat such work using preprocessing in your indexing if you
think you can capitalize on it. Note that different jurisdictions may
have differences in formatting, or even a different court's format
compared to others.
new Please note that the corpus
distributed consists of three parts (intelllex-sg.zip
,
intellex-au.zip
and intellex-etc.zip
), due
to different release times. Please dump all of these files into a
single directory (e.g., intellex
) to use with the
-i
command line switch. You should have a total of 29,869
documents after consolidation (ignoring duplicate documents).
As introduced in Week 8, Zones are free text areas usually within a document that holds some special significance. Fields are more akin to database columns (in a database, we would actually make them columns), in that they take on a specific value from some (possibly infinite) enumerated set of values.
Along with the standard notion of a document as a ordered set of words, handling either / both zones and fields is important for certain aspects of case retrieval.
You might notice that many of the terms used in the text of the legal cases themselves do not overlap with the query terms used. This is known as the anomalous state of knowledge (ASK) problem or vocabulary mismatch, in which the searcher may use terminology that doesn't fit the documents' expression of the same semantics. A simple way that you can deal with the problem is to utilize query expansion.
In this technique, we use a first round of retrieval on the query terms used by a searcher to find some sample documents. Assuming that these documents are relevant, we can extract sometimes found these documents or use the entire documents themselves as queries, used in a second round of retrieval. The idea is that the sample documents have terminology that match the document corpus, overcoming the problem of vocabulary mismatch.
README.txt
,
index.py
, search.py
,
dictionary.txt
, and postings.txt
. Please
do not include the legal case
corpus.You are allowed to do this assignment individually or as a team of up to 4 students. There will be no difference in grading criteria if you do the assignment as a large team or individually. For the submission information below, simply replace any mention of a matric number with the matric numbers concatenated with a separating dash (e.g., A000000X-A000001Y-A000002Z). Please ensure you use the same identifier (matric numbers in the same order) in all places that require a matric number.
For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:
README.txt
: this is
a text only file that describes any information you want me to know
about your submission. You should
not include any identifiable information about your
assignment (your name, phone number, etc.) except your matric number
and email (we need the email to contact you about your grade, please
use your A*******@nus.edu.sg address, not your email alias). This is
to help you get an objective grade in your assignment, as we won't
associate matric numbers with student names. You should use
the README.txt
template given to you in Homework #1 as
a start. In particular, you need to assert whether you followed
class policy for the assignment or not.These files will need to be suitably zipped in a single file called
<matric number>.zip
. Please use a zip archive and
not tar.gz, bzip, rar or cab files. Make sure when the archive unzips
that all of the necessary files are found in a directory called
<matric number>
. Upload the resulting zip file to
the IVLE workbin by the due date: 14 April 2017, 11:59:59 pm
SGT. There will absolutely be no extensions to the deadline of this
assignment. Read the late policy if you're not sure about grade penalties for lateness.
The grading criteria for the assignment is below. You should note that there are no essay questions for this assignment.
temp-*.*
in the current
directory for query processing.-i
) are correctly interpreted
(add trailing slash if needed). Check that your output is in
the correct format (docIDs separated by single spaces, no
quotations, no tabs).<show>
,
<show>
, <hide_url>
,
<hide_blurb>
, <modified>
,
<date_modified>
, and
<_version_>
are not relevant to the search
task. You may want to omit this information in your
indexing. He had also indicated that the
<areaoflaw>
tags (when available) can be
quite helpful. Also as the queries were made and assessed by a
lawyer working in Singapore, you may find documents from the
SG
jurisdiction more relevant.<content>
string indexed as a normal cosine,
tf×idf field; 2) Support phrasal search queries, by
adding a positional index or a n-word index; 3) Find a method
to cut the document into different zones, which may further
help you weight different parts of the documents in matching
your search. There are plenty of other things you can try, so
spend some time with the data and the relevant and irrelevant
documents to decide what seems to work well.