In our final Homework 4, we will hold an information retrieval contest with real-world documents and queries: the problem of patent retrieval. As described in lecture, patent retrieval is a case where recall is particularly important, as it is important to not miss any relevant documents (a requirement common to search engines working in the area of law).
Jump to the competition framework, the current leaderboard, 2015, 2014, or 2013 leaderboard.
The indexing and query commands will use an (almost) identical input format to Homeworks #2 and #3, so that you need not modify any of your code to deal with command line processing. To recap:
Indexing: $ python index.py -i directory-of-documents
-d dictionary-file -p
postings-file
Searching: $ python search.py -d dictionary-file -p
postings-file -q
query-file -o output-file-of-results
The difference from Homeworks #2 and #3 is that the
query-file
specifies a single query, and not a list of
queries.
However, significantly different from the previous homework, we will be using a patent corpus, provided by PatSnap, a company with origins partially from NUS (Disclaimer: I have no interest or affiliation with PatSnap, although one alumnus from my group is working there).
Problem Statement: Given 1) a patent corpus (posted to IVLE) as the document collection to retrieve from, and 2) a set of free text information needs, return the list of the IDs of the relevant documents for each need, in sorted order or relevance. Your search engine should return the entire set of relevant documents (don't threshold to the top K relevant documents; as described, recall is important in patent search).
Your system should return the results for the queryquery-file
on a single line. Separate the IDs of different documents by a single space ' '. Return an empty line if no patents are relevant.
For this assignment, no holds are barred. You may use any type of preprocessing, post-processing, indexing and querying process you wish. You may wish to incorporate or use other python libraries or external resources; however, for python libraries, you'll have to include them with your submission properly -- I will not install new libraries to grade your submissions.
PatSnap, the company we are working with for this contest, is particularly interested in good IR systems for this problem and thus is cooperating with us for this homework assignments. They have provided the corpus (the patents are in the public domain, as is most government information) and relevance judgments for a small number of queries. Teams that do well may be approached by PatSnap to see whether you'd like to work further on your project to help them for pay. Note: Your README will be read by both Min and the PatSnap team, but your code will not be given to their team to use; if they are interested in what you have done, you may opt to license your work to them.
The patents and the information needs have a particular structure in this task. Let's start with the information needs.
Information Need: We call the inputs information
needs, as they describe the relevant documents at a semantic
level, and not (necessarily) at the shallow, language level that the
queries given to the search engine will have to execute. The needs
will be given in a format similar to TREC queries. They will have a
title
field, which is a short noun phrase or sentence
describing the information need. A description
field
will give more detail on what the relevant documents may or may not
contain (will always start with "relevant documents will describe".
Here is an example information need (also provided in the
workbin):
<?xml version="1.0" ?> <title> Washers that clean laundry with bubbles </title> <description> Relevant documents will describe washing technologies that clean or induce using bubbles, foam, by means of vacuuming, swirling, inducing flow or other mechanisms. </description>
In PatSnap's own system, searchers need to transform these needs into actual search queries. For the above need, a patent engineer transformed it into the following Boolean query:
((bubble AND fine) OR microbubble)This requires some human knowledge from the patent engineer to do, as "fine" and "microbubble" don't appear any where in the description or title of the query. This is shown for illustrative purposes, please don't interpret this as an actual step you'll need to do for your assignment. Note that this transformation 1) was done to deal with the Boolean nature of their search engine, and 2) may not reflect the best method to transform the need into a query.
Query Relevance Assessments:
The queries also come with two files, marked as +ve
and -ve
. These are lists of positive (relevant) and
negative (irrelevant) patent documents within the PatSnap corpus. You
should ideally have your system only retrieve documents from the
positive list. As patent judgments are expensive (we had a patent
expert assign the judgments made available to you), the bulk of the
PatSnap corpus was not assessed for their relevance. That is, there
may be additional documents that are relevant in the corpus that are
not listed in the positive file. However, you will be assessed only
on those that show up in the positive and negative lists.
Patents:
Patents are structured documents. For the purposes of our assignment, we are going to use an XML representation of a patent. Below is a document, ID EP2067524A1, which is relevant to the above query:
<?xml version="1.0" ?> <doc> <doc> <str name="Patent Number">EP2067524A1</str> <str name="Application Number">EP2007828700</str> <str name="Kind Code">A1</str> <str name="Title">SWIRLING FLOW PRODUCING APPARATUS, METHOD OF PRODUCING SWIRLING FLOW, VAPOR PHASE GENERATING APPARATUS, MICROBUBBLE GENERATING APPARATUS, FLUID MIXER AND FLUID INJECTION NOZZLE</str> <str name="Abstract"> There are provided a fluid injection nozzle, a fluid mixer, a microbubble generating apparatus, a vapor phase generating apparatus, a method of producing swirling flow, and a swirling flow producing apparatus that can be applied to any kind of fluid and can efficiently generate a swirling flow at high speed. The swirling flow producing apparatus includes a housing and a cylindrical member. The housing includes a cylindrical portion of which at least one end is opened, and a fluid introducing passage that is opened on an inner peripheral surface of the cylindrical portion. The cylindrical member is provided in the cylindrical portion of the housing. The cylindrical member includes a cylindrical portion of which at least one end in a direction corresponding to an opening direction of the cylindrical portion is opened, and holes formed in a peripheral wall of the cylindrical portion. A fluid introduced from the fluid introducing passage flows into the cylindrical portion of the cylindrical member through the holes so as to generate a swirling flow, and flows out from the housing and the cylindrical member. </str> <str name="Document Types">EP | EPA | DOCDB</str> <str name="Application Date">2007-09-28</str> <str name="Application Year">2007</str> <str name="Application(Year/Month)">2007-09</str> <str name="Publication Date">2009-06-10</str> <str name="Publication Year">2009</str> <str name="Publication(Year/Month)">2009-06</str> <str name="All IPC">B05B1/34 | B01F5/00</str> <str name="IPC Primary">B05B1/34</str> <str name="IPC Section">B</str> <str name="IPC Class">B05</str> <str name="IPC Subclass">B05B</str> <str name="IPC Group">B05B1</str> <str name="Family Members">KR1020090028835A | WO2008038763A1 | CN101505859A | US20090201761A1 | EP2067524A1</str> <str name="Family Member Count">5</str> <str name="Family Members Cited By Count">1</str> <str name="Other References">See references of WO 2008038763A1</str> <str name="Other References Count">1</str> <str name="Cited By Count">0</str> <str name="Priority Country">JP</str> <str name="Priority Number">2006264652</str> <str name="Priority Date">2006-09-28</str> <str name="Assignee(s)">NAKATA COATING CO., LTD.</str> <str name="1st Assignee">NAKATA COATING CO., LTD.</str> <str name="Number of Assignees">1</str> <str name="1st Assignee Address">82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa 240-0041, JP</str> <str name="Assignee(s) Address">82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa 240-0041, JP</str> <str name="Inventor(s)">MATSUNO, TAKEMI | NAKATA, AKIO</str> <str name="1st Inventor">MATSUNO, TAKEMI</str> <str name="Number of Inventors">2</str> <str name="1st Inventor Address">NAKATA, COATING, CO., LTD., 82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa, 240-0041, JP</str> <str name="Inventor(s) Address">NAKATA, COATING, CO., LTD., 82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa, 240-0041, JP | NAKATA, COATING, CO., LTD., 82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa, 240-0041, JP</str> <str name="Agent/Attorney">HOFFMANN, ECKART</str> <str name="cited by within 3 years">0</str> <str name="cited by within 5 years">0</str> </doc>
You will notice that there are a lot of fields in the patent. However, not all fields things are relevant to assessing a patent's relevance to the query (and thus you may not want to index them), but are included for the sake of completeness.
in particular, the IPC (International Patent Classification) is a useful piece of data that you may want to use to assess the relevance of a document. It is a hierarchical classification of the patent into a ontology. However you may need to parse this information in some way to make it useful to your system.
As introduced in Week 8, Zones are free text areas usually within a document that holds some special significance. Fields are more akin to database columns (in a database, we would actually make them columns), in that they take on a specific value from some (possibly infinite) enumerated set of values.
Along with the standard notion of a document as a ordered set of words, handling either / both zones and fields is important for certain aspects of patent retrieval.
You might notice that many of the terms used in the patents themselves do not overlap with the query times used. This is known as the anomalous state of knowledge (ASK) problem or vocabulary mismatch, in which the searcher may use terminology that doesn't fit the documents' expression of the same semantics. A simple way that you can deal with the problem is to utilize query expansion.
In this technique, we use a first round of retrieval on the query terms used by a searcher to find some sample documents. Assuming that these documents are relevant, we can extract sometimes found these documents or use the entire documents themselves as queries, used in a second round of retrieval. The idea is that the sample documents have terminology that match the document corpus, overcoming the problem of vocabulary mismatch.
index.py
,
search.py
, dictionary.txt
, and
postings.txt
. Please do
not include the patent corpus.You are allowed to do this assignment individually or as a team of up to 4 students. There will be no difference in grading criteria if you do the assignment as a large team or individually. For the submission information below, simply replace any mention of a matric number with the matric numbers concatenated with a separating dash (e.g., A000000X-A000001Y-A000002Z). Please ensure you use the same identifier (matric numbers in the same order) in all places that require a matric number
For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:
README.txt
: this is
a text only file that describes any information you want me to know
about your submission. You should
not include any identifiable information about your
assignment (your name, phone number, etc.) except your matric number
and email (we need the email to contact you about your grade, please
use your A*******@nus.edu.sg address, not your email alias). This is
to help you get an objective grade in your assignment, as we won't
associate matric numbers with student names. You should use
the README.txt
template given to you in Homework #1 as
a start. In particular, you need to assert whether you followed
class policy for the assignment or not.These files will need to be suitably zipped in a single file called
<matric number>.zip
. Please use a zip archive and
not tar.gz, bzip, rar or cab files. Make sure when the archive unzips
that all of the necessary files are found in a directory called
<matric number>
. Upload the resulting zip file to the IVLE
workbin by the due date: updated 15 April 2016, 8 April 2016, 11:59:59 pm SGT. There
will absolutely be no further extensions to the deadline of this
assignment. Read the late policy if you're not sure about grade penalties for lateness.
The grading criteria for the assignment is below. You should note that there are no essay questions for this assignment.
temp-*.*
in the current
directory for query processing.-i
) are correctly interpreted
(add trailing slash if needed). Check that your output is in
the correct format (docIDs separated by single spaces, no
quotations, no tabs).