In this homework, you will be implementing a language detection module using the ngram knowledge you learned in Week 1. Given a string representing some natural language utterance, your program should be able to predict whether the text is Indonesian, Malaysian or (phonetically transcribed into English) Tamil. So given the following three strings:
Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.
Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.
Maitap piiviyiar cakalarum cutantiramkav piakkiaar
... an ideal program should output/predict the following labels for the strings:
malaysian | Semua manusia dilahirkan bebas ... |
indonesian | Semua orang dilahirkan merdeka ... |
tamil | Maitap piiviyiar cakalarum cutantiramkav piakkiaar ... |
You should run your program to build and test your LMs in this format:
$ python build_test_LM.py -b input-file-for-building-LM -t input-file-for-testing-LM -o output-file
where input-file-for-building-LM
is a file given to
you that contains a list of strings with their labels for you to build
your ngram language models, input-file-for-testing-LM
is
a file containing a list of strings for you to test your language
models, and output-file
is a file where you store your
predictions.
To evaluate the accuracy of your predictions, you run the evaluation
file eval.py
which is given to you:
$ python eval.py file-containing-your-results file-containing-correct-results
where file-containing-your-results
is the output-file
from the build and test step, and
file-containing-correct-results
is a file containing the
correct string labels.
For example, in the homework package, you are given several files,
including a skeleton build_test_LM.py
, eval.py
, input.train.txt
, input.test.txt
, and input.correct.txt
. To build and test your LMs, run:
$ python build_test_LM.py -b input.train.txt -t input.test.txt -o input.predict.txt
which will store your predictions in input.predict.txt
. To evaluate your
predictions, run:
$ python eval.py input.predict.txt input.correct.txt
which prints the accuracy of your predictions.
build_test_LM.py
?The python program build_test_LM.py
is given to you as a skeleton
script. You are required to complete this script by implementing the
build_LM()
and test_LM()
functions.
You need collect the 4-grams from the string where the gram units are characters. For example, for the string Semua manusia dilahirkan bebas ..., the below character 4-grams would be collected (Note that you can choose to pad the beginning and end of the email as shown in slide 13 of lecture 1; it's up to you, but you'll want to document your choice).
[('S', 'e', 'm', 'u'), ('e', 'm', 'u', 'a'), ('m', 'u', 'a', ' '), ('u', 'a', ' ', 'm'), ('a', ' ', 'm', 'a'), (' ', 'm', 'a', 'n'), ('m', 'a', 'n', 'u), ('a', 'n', 'u', 's'), ... ]For each of the malaysian, indonesian and tamil labels, you then build a language model with add one smoothing, similar to the ones shown in slide 20 of lecture 1, which smooths out all observed ngrams. The differences are that you are required to use probabilities instead of counts and 4-grams instead of unigrams. Your language models for the three labels should look in a way similar to the following table, where rows 3 to 4 are the language models for malaysian, indonesian, and tamil, respectively. Note that each row should sum up to 1, and there are other entries in the table that have been omitted for clarity.
Labels | 4-grams | ||||
---|---|---|---|---|---|
... | ('e','m','u','a') | ('m','u','a',' ') | ('u','a',' ','m') | ... |
|
malaysian |
3.11e-07 |
1.03e-07 |
2.07e-07 |
||
indonesian |
5.17e-07 |
2.07e-07 |
1.52e-07 |
||
tamil |
2.89e-06 |
5.17e-07 |
9.65e-07 |
To test a new string, you should multiply the probabilities of the 4-grams for this string, and return the label (i.e., malaysian, indonesian, and tamil) that gives the highest product. Ignore the four-gram if it is not found in the LMs.
input.train.txt:
the input file for you to build
your LMs, which contains about 900 lines, where each line is a
label/string pair separated by a space. The label will either read
malaysian, indonesian, and tamil.input.test.txt:
the input file for you to test your
LMs, which contains 20 strings, each in a line. A small number of strings have been inserted by
extraterrestrial aliens, and do not belong to any of the three
languages your program is asked to detect. This is just to
keep it interesting; you can tell us what language you think they
are, and see whether you can tune your program to tell us that they
are other languages.input.correct.txt:
contains the correct labels for
the input in input.test.txt
, in the same format as
input.train.txt
(i.e., each label/string pair is
separated by a space).build_test_LM.py
(e.g.,
input.predict.txt
): contains your predictions in the
same format as input.train.txt
.You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question. You may receive a small amount of extra credit if you can support your answers with experimental results.
For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:
README.txt
(e.g.,
in UPPERCASE): this is a text only file that describes any
information you want me to know about your submission. You should not include any identifiable
information about your assignment (your name, phone number, etc.)
except your matric number and email (we need the email to contact
you about your grade, please use your [u|a|g]*******@nus.edu.sg address,
not your email alias). This is to help you get an objective grade in
your assignment, as we won't associate matric numbers with student
names.build_test_LM.py
for this homework assignment): We will be reading your code, so please do us
a favor and format it nicely.ESSAY.txt
that contains your
answers to the essay questions (no word processing files. If you use
a word processor to write, make sure it exports plain ASCII text
well). Again, do not disclose any identifiable information in your
essay answers.These files will need to be suitably zipped in a single file called
<matric number>.zip
. Please use a zip archive and
not tar.gz, bzip, rar or cab files. Make sure when the archive unzips
that all of the necessary files are found in a directory called
<matric number>
. Upload the resulting zip file to
the IVLE workbin by the due date: 7 Feb 2014,
11:59:59 pm SGT. There absolutely will be no extensions to the
deadline of this assignment. Read the late policy if you're not sure
about grade penalties for
lateness.
The grading criteria for the assignment is tentatively:
input.test.txt
(may have some alien strings too).A language is a dialect with an army and navy Max Weinreich -- see (Wikipedia)So don't expect your program to distinguish these two languages well.