Quick Links: [ Home ] [ IVLE ] [ Project Info ] [ Schedule ] [ Details ] [ HW 1 ] [ HW 2 ]
As per our lecture materials on authorship detection, you will be creating machine learning features to be used with a standard machine learner to try to guess the identity of the author. We will be using papers from reviews of books from Amazon.com to try to compute attribution.
You are to come up with as many features as you can think of for computing the authorship of the files. "Classic" text categorization usually work to compute the subject category of the work. Here, because many of the book reviewers examine similar subjects and write reviews for a wide range of books, standard techniques will not fare as well. You should use features that you come up with, as well any additional features you can think of. You can code any feature you like, but you will be limited to 30 real-valued features in total. You do not have to use all 30 features if you do not wish to.
In the workbin, you will find a training file containing reviews of books and other materials from 2 of Amazon's top customer reviewers (for their .com website, Amazon keeps different reviews in different counties -- e.g., UK). It is organized as 1 review per line, and thus are very long lines. Each line gives the review followed by a tab (\t) and then the author ID (either +1 or -1). There are 100 examples per author in the training section.
In the test.txt
file you will find a list of reviews,
again, one per line, but without the author ID given.
We are going to use the SVM light package authored by Thorsten Joachims as the machine learning framework. You should familiarize yourself with how to apply SVM to your dataset. SVM has a very simple format for vectors used to train and test its hypothesis. I should have demonstrated this during class. Be aware that training an SVM can be quite time-intensive, even on a small training set.
You will upload an X.zip (where X is your matric ID) file by the due date, consisting of the following four files:
svm_train
with defaults (i.e., no arguments) to
the machine learner.
Updated on Wed Oct 15 15:25:12 GMT-8 2003. Please use a ZIP (not RAR or TAR) utility to construct your submission. Do not include a directory in the submission to extract to (e.g., unzipping X.zip should give files X.sum, not X/X.sum or submission/X.sum). Please use all capital letters when writing your matric number (matric numbers should start with HT or HD for all students in this class). Your cooperation with the submission format will allow me to grade the assignment in a timely manner.
Your grade will be 75% determined by performance as judged by accuracy, and 25% determined by the summary file you turn in and the smoothness of the evaluation of your submission. For example, if your files cause problems with the machine learner (incorrect format, etc.) this will result in penalties within in that 25%.
Of the remaining 75%, 45% of the grade will be determined by your training set performance, and 30% determined by the testing performance. For both halves, the grade given will be largely determined by how your system performs against your peer systems. I will also have a reference, baseline system to compare your systems against, this will constitute the remaining portion of the grade.
Please note that it is quite trivial to get a working assignment that will get you most of the grade. Even the single feature classifier using a feature of review-length-in-words, would be awarded at least 40 marks (the 25% plus 15% worse than baseline performance). I recommend that you make an effort to complete a baseline working system within the first week of the assignment.
There is no late policy for this homework. Submissions should be submitted by the due date, by 11:59:59 pm. Any lateness will result in 0 marks being award for the whole assignment.
Quick Links: [ Home ] [ IVLE ] [ Project Info ] [ Schedule ] [ Details ] [ HW 1 ] [ HW 2 ]