In this homework assignment, you will be implementing logistic
regression using batchstochastic gradient descent
for binary classification. After implementing the programme, you
will need to carry out a small set of experiments based on your
working code. Complete the code in the template
hw1-1.ipynb
file provided. Make sure you document your
code liberally using inline python comments (and Markdown cells where
appropriate).
Update: Implement a
fixed learning rate algorithm for logistic regression (LFD,
page 97). Use the "batch" formulation of
gradient descent (except where indicated), instead of the more
efficient stochastic gradient descent (this is to assist with
automated grading). Initialize the weight with a zero vector
0, a (column) vector of all 0's, similar to Tutorial 1. Name
your defined function lr
for doing one iteration of
logistic regression. It should work similar to the pla
function in Tutorial 1, where the original weights are input and
updated weights output by running one iteration.
Use the cross-entropy error measure for minimization (this is "equivalent" to maximum likelihood estimation); pointwise, this is en(w) = ln (1 + e-ynwTxn).
Note that your code will also need to calculate
Ein (for gradient calculation) and
Eout (for evaluation on the testing set); so
creating a function that can calculate the actual error rate is
needed. In this problem, although we'll run it as regression problem,
the outputs are real valued at +1 and -1 and is deemed as
classification (think about the issues described in lecture with
respect to using regression for classification). Thus, the
out-of-sample error Eout is given by 0/1 loss. Your
lr
function also needs to take a learning rate variable,
η (eta) as a parameter, which controls the step size.
After implementation, run the following experiments in different
cells. In particular, pay attention to part (c) which may require you
to rework part of your lr
function to take a
parameter that modifies its instance selection.
Update: For a small amount of optional bonus marks (1 or 2; the assignment is worth 100 marks), you can also complete work to help show the visualization of your work on the training data. Note: This is more work that may not be commensurate to the amount of effort you need to do to accomplish the task. The below optional exercises only need to be done for part (a).
You'll find these files in the sample submission .zip.
hw1-train.txt:
the input file for you to train your
logistic regression classifier model, which contains 1000 samples of
20-dimensional data, with the final class 1
or
-1
as an output, all separated by spaces.
hw1-test.txt:
a file of test data, containing an
additional 3000 samples of the same data; again with the final class
1
or -1
as an output, all separated by
spaces.
You are also asked to answer the following essay questions. These
are to test your understanding of the lecture materials. Note that
these are open-ended questions and do not have gold standard answers.
Note that you may need to write math equations or notation in the
files, so please familiarize yourself with this. You can either use a
word processor or do a dump (i.e., print to .pdf) of another program
that can display math (i.e., iPython notebook, R). Submit it as part
of your .zip submission, exactly named as essay1.pdf
.
It can be useful to repeat the question in your essay submission, for readability purposes, but it's not mandatory.
For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. We have over 100 students in the class, and assuming it takes 10 minutes to grade each submission -- that would be over 1.3 days of uninterrupted grading with no sleep to clear. Following the submission guideliness will help us grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and all letters should be in CAPITALS (inclusive of the ending check letter). You are to turn in the following files:
Update: Your
source code file hw1-1.ipynb
for this homework
assignment (the "-1" suffix for the first programming problem, we
only give one problem here). We will be reading your code, so please do us
a favor and format it nicely. To expedite
grading, we need to require all of the assignments to use Python 3.4
(3.5 should ok as well).
In your source code .ipynb
, describe any information
you want us to know about your submission, inclusive of your
suitably filled out independent work statement. You should not include any identifiable
information about your assignment (your name, phone number, etc.)
except your matric number and email (we need the email to contact
you about your grade, please use your [a|e]*******@nus.edu.sg
address, not your email alias). This is to help you get an objective
grade in your assignment, as we won't associate matric numbers with
student names. You can include this information at the top of your
source code in Markdown cell(s) in the (just the first
-1.ipynb
, when you have multiple problems)
.ipynb
notebook source file.
At the end of your (first) notebook source file, you need to put your Statement of Independent Work. Use a Markdown cell and follow the instructions given in tutorial about how to fill this section out. Please be honest and if you do not follow the course policy, suggest a plausible alternative evaluation for us to consider.
Also put pointers or text to any (online or offline) references you have utilised in making your submission in an appropriate References cell.
A .PDF file essay1.pdf
that contains your answers to
the essay questions (no word processing files, please export to PDF
with any unusual fonts embedded within the document if needed.)
Again, do not disclose any identifiable information in your essay
answers.
Any ancillary files that your submission uses. We may or may
not read them, so please put the core part of your programming
answers in the hw1-<x>.ipynb
file.
You do not need to submit the sample files that are provided for you in the sample .zip file, these are just for your reference.
These files will need to be suitably zipped in a single file called
<matric number>.zip
. Please use a zip archive and
not tar.gz, bzip, rar or cab files. Make sure when the archive unzips
that all of the necessary files are found in a directory called
<matric number>
(note to Mac users please test your
zip archives on the command line; as OSX "automagically" creates the
parent folder but it is often actually not contained in the
archive). Upload the resulting zip file to the IVLE workbin by the due
date: 5 Sep 2016 13
Sep 2016, 11:59:59 pm SGT. There absolutely will be no
extensions to the deadline of this assignment. Read the late policy if
you're not sure about grade penalties for
lateness.
New: It's suggested
that you do this homework assignment in stages, and not attempting to
tackle the whole thing in one step. In particular, those new to
Python should try to first import the data into your program. You can
write the code to import the data yourself, or use a software library,
such as pandas to help you import. You may find some online
utilities to do logistic regression (or desktop programs such as
Microsoft Excel), which you can run with a smaller subset of the data
(both in terms of the input dimensions as well as number of
instances), to do some sanity checking against your own
lr
routine.
The grading criteria for the assignment is tentatively: