Note that the syllabus and readings are still in flux for the time
being. Readings marked with a "*" will be present in the course pack.
Readings in small print are primary materials
(i.e., original conference and journal papers); read these after the
secondary materials (i.e., textbook chapters) if possible.
Aside from our lecture slides, the textbook slides are linked to
for your reference -- the links below are to a local copy of the
slides given by Hinrich Schütze (by permission), you may want to
refer to the original set at http://www-csli.stanford.edu/~hinrich/newslides.html.
I do not vouch for the correctness or the material presented in any of
the linked slide sets.
The hyperlinks here all work as of Thu Nov 6 10:43:33 SGT 2008
when I last updated this page. Use a search engine with the appropriate
text if the links below stop working.
Date
|
Description
|
Deadlines
|
Week 0:
| Prerequisites
(Please read before coming to class and be familiar with the material)
Readings:
- *P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 1
"Mathematical Foundations" of Modeling the
Internet and the Web. Wiley.
(Covers basic math foundation needed for the course. The
topics introduced here are basically a nutshell of most of the
material we will cover in more depth in class. Warning! this
is a dense chapter, expect to have to read it a couple times.
Contents:
Probability from a Bayesian Perspective, Parameter Estimation from
Data, Mixture models and the Expectation Maximization Algorithm,
Graphical Models, Classification, Clustering, Power Laws)
|
|
Week 1: (13 Aug)
| Introduction to Web-Based Searches
Class Slides: [ .pdf ]
Readings:
- P. Baldi, P. Frasconi and P. Smyth (2003) Chapters 2 and 3
"Basic WWW Technologies" and "Web Graphs" of Modeling the
Internet and the Web. Wiley. (You should
already be familiar with Chapter 2's material from the
Hypermedia or equivalent pre-requisite, so you should spend
more time reading Chapter 3's material).
[ Chapter 2 slides (.ppt) ]
[ Chapter 3 slides (.ppt) ]
- C. Manning, P. Raghavan and H. Schütze (2008) Chapter 19
"Web Search Basics" of Introduction to Information
Retrieval. Cambridge UP. (Caution: may be
too advanced for this stage in our course. Skim and re-read
closer to the end of the course.)
[ .pdf of Chapter 19 ]
[ Chapter 19 Slides (.pdf) ]
- *S. Lawrence and C.L. Giles (1999). Accessibility of information
on the web. Nature, Vol. 400(8), pp. 107-109. (Short note
describing how articles easily available on the internet
(self-archived) create larger impact)
[ Link ]
- *A-L. Babarasi and R. Albert
(1999). Emergence of scaling in random networks. Science, Volume 286.
Pre-print. (A good article covering lots of area on evolving networks. Try to understand up to section VIII.)
[ ArXiV link ]
|
|
Week 2: (20 Aug)
| Intro to IR and the Vector-Space Model
Class Slides: [ .pdf ]
Readings:
- P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.3
"Content-Based Ranking" of Modeling the
Internet and the Web. Wiley. (There is a link to the
.pdf for the whole of Chapter 4
provided by the authors as their sample chapter. We will be
using this chapter as the basic overview for the next week.)
[ .pdf of Chapter 4 from UC Irvine ]
[ Chapter 4 slides (.ppt) ]
- C. Manning, P. Raghavan and H. Schütze (2008) Chapters 6-9
"Scoring, term weighting & the vector space model", "Computing scores in a complete search system", "Evaluation in information retrieval" and "Relevance feedback and query expansion" of Introduction to Information
Retrieval. Cambridge UP. (Covers the same as the
Baldi book but in slightly more depth).
[ .pdf of Chapter 6 ]
[ .pdf of Chapter 7 ]
[ .pdf of Chapter 8 ]
[ .pdf of Chapter 9 ]
[ Chapter 6 Slides (.pdf) ]
[ Chapter 7 Slides (.pdf) ]
[ Chapter 8 Slides (.pdf) ]
[ Chapter 9 Slides (.pdf) ]
- *G. Salton (1972). Dynamic document processing. Communications
of the ACM, Vol. 15(7), pp. 658-668.
(One of the original papers from the inventors of IR. Check it out!)
[ ACM Portal Link ]
- *K. Järvelin and J. Kekäläinen (2002). Cumulated gain-based evaluation of IR techniques, Transactions on Information Systems. (The paper that described the newer evaluation metric nDCG).
[ .pdf from uta.fi ]
|
|
Week 3: (27 Aug)
| Probabilistic IR Model and Language Modeling
Class Slides: [ .pdf ]
Readings:
- C. Manning, P. Raghavan and H. Schütze (2008) Chapters 11-12
"Probabilistic Information Retrieval" and "Language Models for Information Retrieval" of Introduction to Information Retrieval. Cambridge UP. (These two chapters should be considered the primary source for this week; skip 11.3.4, 11.4.2-11.5, 12.4)
[ .pdf of Chapter 11 ]
[ .pdf of Chapter 12 ]
- P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.4
"Probablistic Retrieval" of Modeling the
Internet and the Web. Wiley.
- *K. Sparck Jones, S. Walker and S.E. Robertson (1998). A probabilistic model of information retrieval: development and
status. Technical Report 446, Cambridge University Computer
Laboratory. (This is a very complete description of probabilistic IR from the people who pioneered it; you can just read Sections 2 & 4; if you want to know more about relevance feedback, read Sections 5 and 6)
[ SIG.IR NUS Link ]
- *J.M. Ponte and W.B. Croft (1998) A language
modeling approach to information retrieval. ACM SIGIR 1998, pp
275-281. (Discusses the language modeling approach to IR -- still much more to be done here with increasingly large data sets)
[ CiteSeerX Link ]
| Homework #1 Announced
|
Week 4: (3 Sep)
| Improving Search I - Dimensionality Reduction and Tutorial 0 / Make Up Lecture 1
Class Slides: [ .pdf ]
Supplement on Linear Algebra: [ .pdf ]
Readings:
- P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.5 "Latent Semantic Analysis" Modeling the
Internet and the Web. Wiley.
- C. Manning, P. Raghavan and H. Schütze (2008) Chapter 18
"Dimensionality Reduction and Latent Semantic Indexing" of
Introduction to Information Retrieval. Cambridge UP. (Covers the same material as
the Baldi et al. book, but in more depth)
[ .pdf of Chapter 18 ]
- *S. Deerwester, S. Dumais, G. Furnas,
T. Landauer and R. Harshman
(1990). Indexing by latent semantic analysis. Journal of the American
Society of Information Science, Vol. 41(6), pp. 391-407. (An
expanded version of the original paper that pioneered dimensionality
reduction)
[ CiteSeer@NUS Link ]
- *T. Hofmann (1999)
Probabilistic latent semantic indexing. ACM SIGIR 99. (The
breakthrough paper that is the basis for newer Bayesian analysis to
dimensionality reduction)
[ ACM Portal Link ]
- *D. Blei, A.Y. Ng, and M.I. Jordan (2003) Latent Dirichlet Allocation. In J. of A.I. Research. (a three level Bayesian model that was applied to pLSI. This paper is now used as a basis for much further NLP and IR research under the Bayesian framework.)
[ Link from Stanford ]
|
|
Week 5: (10 Sep)
| Improving Search II - Use of Links and Structures
Class Slides: [ .pdf ]
Readings:
- P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 5 "Link Analysis" in Modeling the
Internet and the Web. Wiley.
[ Chapter 5 slides (.ppt) ]
- C. Manning, P. Raghavan and H. Schütze (2008) Chapter 21
"Link Analysis" of Introduction to Information Retrieval. Cambridge UP.
[ .pdf of Chapter 21 ]
[ Chapter 21 Slides (.pdf ]
- *S. Brin and L. Page (1998). The anatomy of a large-scale
hypertextual web search engine. Proceedings of the 7th International
World Wide Web Conference (WWW7), Brisbane, Australia,
pp. 107-117. (This is the original paper on the PageRank
algorithm)
[ CiteSeer@NUS link ]
- T.H. Haveliwala (2002). Topic-Sensitive PageRank. Proceedings
of the 11th International World Wide Web Conference (WWW2002),
Honolulu, Hawaii, USA. (Making PageRank biased to some "basis" topics by playing with the teleportation factor. Also used for detecting spam using "trusted" sites as bases.)
[ CiteSeerX Link ]
- J. Kleinberg (1998). Authoritative sources in
a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete
Algorithms. (Describes HITS; when we decouple both ends of the directed edge in calculating prestige)
[ CiteSeerX Link ]
- R. Lempel and S. Moran (2000).
The stochastic approach for link-structure analysis (SALSA) and
the TKC effect. Proceedings of WWW 9 (1999). (Bringing
Kleinberg's HITS to a bipartite framework; and explaining its benefit
to Tightly Knit Communities)
[ CiteSeerX Link ]
|
|
Week 6: (17 Sep)
| Improving Search III - Relations and Passage Retrieval and Tutorial - Retrieval 1 (directly after class)
Class Slides: [ .pdf ]
Readings:
- H. Cui, R. Sun, K. Li, M.Y. Kan, T.S. Chua (2005). Question
Answering Passage Retrieval Using Dependency Relations. ACM SIGIR,
400-407. (better ranking based on grammatical dependencies between words in a passage)
[ Link from Min's page ]
- H. Cui, J.R. Wen, J.Y. Nie and W.Y. Ma (2002). Probabilistic
query expansion using query logs. Proceedings of the 11th
International World Wide Web Conference (WWW2002), Honolulu, Hawaii,
USA. (query expansion from another external resource: query
logs)
[ CiteSeerX Link ]
- *R.M. Tong, L.A. Appelbaum, V.N. Askman and J.F. Cunningham
(1987). Conceptual information retrieval using RUBRIC. Proceedings of
the 10th ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR'87), New Orleans, Louisiana, USA,
pp. 247-253. (An early paper that discusses how thesaural
knowledge can be integrated to IR; from the pre-WordNet era)
[ ACM Portal Link ]
- *H. Yang, T.S. Chua, S. Wang and C.K. Koh. (2003) Structured use
of external knowledge for event-based open-domain
question-answering. 26th Int'l ACM SIGIR Conference. (Putting
together resources in a unified manner)
[ Link to CMU's copy ]
- *S. Tellex, B. Katz, J. Lin, A. Fernandes and
G.Marton (2003) Quantitative Evaluation of Passage Retrieval
Algorithms for Question Answering. (Survey paper on passage
retrieval methods; used in class for example ranking exercise)
[ MIT CSAIL Link ]
|
|
Recess Week (Sat 20 Sep - Sun 28 Sep)
|
Holiday - Hari Raya Puasa (1 Oct)
|
Week 7: (Note special date and location; 2 Oct 6:30-8:30pm SR6, make up lecture)
| Question Answering
Class Slides: [ .pdf ]
Readings:
- *L. Hirschman, M. Light, E. Breck And J.D. Burger (1999). Deep
Read: a Reading Comprehension System. Proceedings of the 37th Meeting
of the Association for Computational Linguistics (ACL'99), College
Park, Maryland, USA, pp. 325-332.
[ ACL Anthology Link ]
- *H. Yang and T.S. Chua (2003). QUALIFIER: question answering
by lexical fabric and external resources. Proceedings of the 10th
Conference of the European Chapter of the Association for
Computational Linguistics (EACL 03) (Density based methods for
passage retrieval leading up to questions answering)
[ CiteSeerX Link ]
- *D. Moldovan and A. Novischi (2002). Lexical Chains for Question
Answering. Proceedings of the 19th International Conference on
Computational Linguistics (COLING 2002), Taipei, Taiwan.
[ ACL Anthology Link ]
- *E. Voorhees (2002). Overview of the TREC 2002 Question
Answering Track, In Notebook of the Eleventh Text Retrieval Conference
(TREC 2002), 115-123.
[ CiteseerX Link ]
-
Hang Cui, Min-Yen Kan and Tat-Seng Chua (2004) Unsupervised
Learning of Soft Patterns for Generating Definitions from Online
News. In Proceedings of the 13th International World Wide Web
Conference (WWW2004), May 2004. New York, New York, USA.
[ Link from Min's Page ]
| Homework #1 Due Homework #2 Announced
|
Week 8: (8 Oct)
| Summarization and Tutorial - Retrieval 2
Class Slides: [ .pdf ]
Readings:
- *J. Kupiec, J. Pedersen and F. Chen (1995). A trainable document
summarizer. Proceedings of the 18th ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR'95), Seattle,
Washington, USA, pp. 68-73. (The work that took out the heuristic approaches to summarization and made it a learning problem)
[ ACM Portal Link ]
- *T. Nomoto and Y. Matsumoto (2001). A new approach to
unsupervised text summarization. Proceedings of the 24th ACM SIGIR
Conference on Research and Development in Information Retrieval
(SIGIR'01), New Orleans, Louisiana, USA, pp. 26-34. (Great paper showing a use of X-means clustering for summarization)
[ ACM Portal Link ]
- *G. Erkan and D. Radev (2004) LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. of AI Research. Vol. 22. (Viewing documents as a graph and using PageRank to compute n-best sentences)
[ Link to UMich copy ]
- *H. Jing and K. McKeown (2004) The decomposition of human-written summary sentences. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 129-136. (Describes how an HMM can be used to align abstracts to source articles)
[ CiteSeerX Link ]
- *K. Knight and D. Marcu (2000) Statistics-Based Summarization Step One: Sentence Compression. Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), pages 703-710. (Combines NLP and the noisy channel model to create a sentence compression scheme)
[ CiteSeerX Link ]
|
|
Week 9: (15 Oct)
| Text Categorization and Tutorial - Summarization (directly after class)
Class Slides: [ .pdf ]
Readings:
- C. Manning, P. Raghavan and H. Schütze (2008) Chapters 13-14
"Text classification and Naïve Bayes" and "Vector space
classification" of Introduction to Information
Retrieval. Cambridge UP.
[ .pdf of Chapter 13 ]
[ .pdf of Chapter 14 ]
[ Chapter 13 Slides (.pdf) ]
[ Chapter 14 Slides (.pdf) ]
- *Y. Yang and J.O. Pedersen (1997). A comparative study on
feature selection in text categorization. Proceedings of the 14th
International Conference on Machine Learning (ICML'97), Nashville,
Tennessee, USA, pp. 412-420.
[ CiteSeerX Link ]
- *Y. Yang and X. Liu (1999). A re-examination of text
categorization methods. Proceedings of the 22nd ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR'99),
Berkeley, California, USA, pp. 42-49.
[ CiteSeerX Link ]
- *L. Man, C.-L. Tan, H.-B. Low, S.-Y. Sung (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines, Poster Paper in WWW '05. (Discusses text classification sensitive weighting schemes)
[ WWW 2005 Link ]
- *J.M. Liu and T.S. Chua (2001). Building semantic perceptron net
for topic spotting. Proceedings of the 39th Meeting of Association of
Computational Linguistics (ACL'01), Toulouse, France, pp. 370-377. (Hierarchical version of text classification using perceptrons)
[ ACL Anthology Link ]
|
|
Week 10: (22 Oct)
| Text Clustering and Tutorial - Categorization (directly after class)
Class Slides: [ .pdf ]
Readings:
| Homework #1 Returned
|
Week 11: (29 Oct)
| Sequence Labeling I and Named Entity Recognition
Class Slides: [ .pdf ]
Experimentation:
- Predicting
Baltimore Weather from Ice Cream Consumption (.xls) - the
Forward Backward estimation algorithm. Used in class, but which you
should also experiment with on your own. The Viterbi decoding version.
- *J. Eisner (2002). An interactive spreadsheet for teaching the forward-backward algorithm. In Proc. of the ACL Workshop on Effective Tools and Methodologies for Teaching NLP and CL, pages 10-18, Philadelphia, July. (Covers how to use this spreadsheet for teaching HMM's forward and backward algorithm. Go through the exercises mentioned to make sure you understand them.
Readings:
- *G. Zhou and J. Su (2002). Named Entity Recognition using an HMM-based Chunk Tagger. Proc. of 40th ACL (ACL '02). pp. 473-480.
[ ACL Anthology Link ]
- *S Cucerzan, D. Yarowsky (1999). Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. Proc. of EMNLP 1999.
[ ACL Anthology Link ]
- Also see the excellent post by Chris Manning and subsequent commentary: Doing Named Entity Recognition? Don't optimize for F1.
|
|
Week 12: (5 Nov)
| Sequence Labeling II and Information Extraction
Class Slides: [ .pdf ]
Readings:
- J. Lafferty, A. McCallum, F. Pereira (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. International Conf. on Machine Learning.
[ Link from CiteSeer@NUS ]
- *A. McCallum, D. Freitag, F. Pereira (2000) Maximum entropy markov models for information extraction and segmentation. International Conf. on Machine Learning.
[ Link from CiteSeerX ]
- *C. Cardie (1997). Empirical methods in information
extraction. AI Magazine, 18(4): 65-79. Special Issue on Natural
Language Processing.
[ CiteSeer@NUS Link ]
- *S. Soderland (1999). Learning information extraction rules for
semi-structured and free text. Machine Learning, Vol. 34(1-3),
pp. 233-272.
[ CiteSeer@NUS Link ]
- *J. Xiao, T. S. Chua and J. M. Liu, A Global Rule Induction
Approach to Information Extraction, ICTAI2003.
[ IEEE Xplore Link ]
| Homework #2 Due
|
Week 13: (12 Nov)
| Learning To Rank and Revision
Class Slides: [ .pdf ]
Readings:
- C. Manning, P. Raghavan and H. Schütze (2008) Section 15.4
"Machine-learning methods in ad hoc information retrieval" of
Introduction to Information Retrieval. Cambridge
UP. (Covers pointwise and pairwise versions
of the Learning to Rank (LeToR) paradigm, refer to the other
papers below for specifics; warning: you'll need to have a good
grasp of statistics and machine learning to read the
supplemental papers, consider reading all of Chapter 15 in the
text first, and possibly a good machine learning book before
attempting the below papers in detail.)
[ .pdf of Chapter 15 ]
- *T.-Y. Liu (2008) Learning To Rank Tutorial Slideset. Presented at WWW 08 and SIGIR 2008. (The much more comprehensive slide set for the LeToR tutorials which compose most of the source of today's slides.)
[ Link to Tie-Yan's page at MSRA (click on the .ppt logos at the bottom of the page) ]
- *N. Fuhr (1989) Optimum polynomial retrieval functions based on the probability ranking principle, In ACM Trans. on Information Systems
[ ACM Portal Link ]
- * W. W. Cohen, R. E. Shapire, Y. Singer (1999) Learning to order things, Journal of Artificial intelligence research (JAIR)
[ Link from CSAIL, MIT ]
- *T. Joachims (2002) Optimizing Search Engines Using Clickthrough Data, In Proc. of KDD
[ Link from Joachim's Website ]
- *R. Nallapati (2004) Discriminative models for information retrieval, In Proc of SIGIR
[ Link to ACM Portal ]
- *Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai and H. Li (2007) Learning to Rank: From Pairwise to Listwise Approach, In Proc. of ICML.
[ Link from Microsoft Research ]
|
|
Reading Week (Sat 15 Nov - Fri 21 Nov)
|
Final Exam (26 Nov, 7:30 SR3 (COM1 02-12))
|