Note that the syllabus and readings are still in flux for the time
being. Readings marked with a "*" will be present in the course pack.
Readings in small print are primary materials
(i.e., original conference and journal papers); read these after the
secondary materials (i.e., textbook chapters) if possible.
Slides linked from this page to the textbook slides are for
reference only. I do not vouch for the correctness or the material
presented in any of the linked slide sets. I will likely be using a
composite of slides compiled from these and various other sources, but
these may not be made available on the internet due to time
preparation constraints.
The hyperlinks here all work as of Tue Dec 19 23:16:21 GMT-8 2006,
when I updated this page. Use a search engine with the appropriate
text if the links below stop working.
Date
|
Description
|
Deadlines
|
Week 0:
| Prerequisites
(Please read before coming to class and be familiar with the material)
Readings:
- *P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 1
"Mathematical Foundations" of Modeling the
Internet and the Web. Wiley.
(Covers basic math foundation needed for the course. The
topics introduced here are basically a nutshell of most of the
material we will cover in more depth in class. Warning! this
is a dense chapter, expect to have to read it a couple times.
Contents:
Probability from a Bayesian Perspective, Parameter Estimation from
Data, Mixture models and the Expectation Maximization Algorithm,
Graphical Models, Classification, Clustering, Power Laws)
|
|
Week 1: (8 Jan)
| Introduction to Web-Based Searches
Readings:
- P. Baldi, P. Frasconi and P. Smyth (2003) Chapters 2 and 3
"Basic WWW Technologies" and "Web Graphs" of Modeling the
Internet and the Web. Wiley. (You should
already be familiar with Chapter 2's material from the
Hypermedia or equivalent pre-requisite, so you should spend
more time reading Chapter 3's material).
[ Chapter 2 slides (.ppt) ]
[ Chapter 3 slides (.ppt) ]
- C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 19
"Web Search Basics" of Introduction to Information
Retrieval. Cambridge UP. (Caution: may be
too advanced for this stage in our course. Skim and re-read
closer to the end of the course.)
[ .pdf of Chapter 19 ]
[ Chapter 19 slides Part 1 (.pdf) ]
[ Chapter 19 slides Part 2 (.pdf) ]
- *S. Lawrence and C.L. Giles (1999). Accessibility of information
on the web. Nature, Vol. 400(8), pp. 107-109. (Short note
describing how articles easily available on the internet
(self-archived) create larger impact)
[ Link ]
- *A-L. Babarasi and R. Albert
(1999). Emergence of scaling in random networks. Science, Volume 286.
Pre-print
[ ArXiV link ]
|
|
Week 2: (15 Jan)
| Intro to IR and Vector-Space Model
Readings:
- P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.3
"Content-Based Ranking" of Modeling the
Internet and the Web. Wiley. (There is a link to the
.pdf for the whole of Chapter 4
provided by the authors as their sample chapter. We will be
using this chapter as the basic overview for the next couple of
weeks.)
[ .pdf of Chapter 4 from UC Irvine ]
[ Chapter 4 slides (.ppt) ]
- C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 7
"Vector Space Retrieval" of Introduction to Information
Retrieval. Cambridge UP. (Covers the same as the
Baldi book but in more depth.)
[ .pdf of Chapter 6 ]
[ .pdf of Chapter 7 ]
[ Chapter 6 slides (.pdf) ]
[ Chapter 7 slides (.pdf) ]
- *G. Salton (1972). Dynamic document processing. Communications
of the ACM, Vol. 15(7), pp. 658-668.
[ ACM Portal Link ]
|
|
Week 3: (22 Jan)
| Probabilistic IR Model and Language Modeling and Tutorial 0 - Math Foundations
Tutorial 0 will be offered both before (5:00-6:00pm) and after
(8:30-9:30pm) class. It will cover Baldi et al., Sections 1.1-1.3.
The other sections will be taught later or covered in other SoC
modules.
Readings:
- C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapters 11-12
"Probabilistic Information Retrieval" and "Language Models for Information Retrieval" of Introduction to Information Retrieval. Cambridge UP. (These two chapters should be considered the primary source for this week; skip 11.3.4, 11.4.2-11.5, 12.4)
[ .pdf of Chapter 11 ]
[ Chapter 11 slides (.ppt) ]
[ .pdf of Chapter 12 ]
[ Chapter 12 slides (.ppt) ]
- P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.4
"Probablistic Retrieval" of Modeling the
Internet and the Web. Wiley.
- *K. Sparck Jones, S. Walker and S.E. Robertson (1998). A
probabilistic model of information retrieval: development and
status. Technical Report 446, Cambridge University Computer
Laboratory. (This is a very complete description of probabilistic IR from the people who pioneered it; you can just read Sections 2 & 4; if you want to know more about relevance feedback, read Sections 5 and 6)
[ CiteSeer@NUS Link ]
- J.M. Ponte and W.B. Croft (1998) A language
modeling approach to information retrieval. ACM SIGIR 1998, pp
275-281. (Discusses the language modeling approach to IR -- still much more to be done here with increasingly large data sets)
[ CiteSeer@NUS Link ]
| Assignment #1 out
|
Week 4: (29 Jan)
| Improving Search I - LSA and Adaptive Search and Tutorial - Retrieval 1
Readings:
- P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.5 "Latent Semantic Analysis" Modeling the
Internet and the Web. Wiley.
- C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 18
"Dimensionality Reduction and Latent Semantic Indexing" of
Introduction to Information Retrieval. Cambridge UP. (Covers the same material as
the Baldi et al. book, but in more depth)
[ .pdf of Chapter 18 ]
[ Chapter 18 slides (.pdf) ]
- *S. Deerwester, S. Dumais, G. Furnas,
T. Landauer and R. Harshman
(1990). Indexing by latent semantic analysis. Journal of the American
Society of Information Science, Vol. 41(6), pp. 391-407. (An
expanded version of the original paper that pioneered dimensionality
reduction)
[ CiteSeer@NUS Link ]
- T. Hofmann (1999)
Probabilistic latent semantic indexing. ACM SIGIR 99. (The
breakthrough paper that is the basis for newer Bayesian analysis to
dimensionality reduction)
[ ACM Portal Link ]
|
|
Week 5: (5 Feb)
| Improving Search II - Use of Links and Structures
- P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 5 "Link Analysis" in Modeling the
Internet and the Web. Wiley.
[ Chapter 5 slides (.ppt) ]
- C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 21
"Link Analysis" of Introduction to Information Retrieval. Cambridge UP.
[ .pdf of Chapter 21 ]
[ Chapter 21 slides (.pdf) ]
- *S. Brin and L. Page (1998). The anatomy of a large-scale
hypertextual web search engine. Proceedings of the 7th International
World Wide Web Conference (WWW7), Brisbane, Australia,
pp. 107-117. (This is the original paper on the PageRank
algorithm)
[ CiteSeer@NUS link ]
- *T.H. Haveliwala (2002). Topic-Sensitive PageRank. Proceedings
of the 11th International World Wide Web Conference (WWW2002),
Honolulu, Hawaii, USA. (Making PageRank biased to some "basis" topics by playing with the teleportation factor)
[ CiteSeer@NUS Link ]
- J. Kleinberg (1998). Authoritative sources in
a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete
Algorithms. (Describes HITS; when we decouple both ends of the directed edge in calculating prestige)
[ CiteSeer@NUS Link ]
- R. Lempel and S. Moran (2000).
The stochastic approach for link-structure analysis (SALSA) and
the TKC effect. Proceedings of WWW 9 (1999). (Bringing
Kleinberg's HITS to a bipartite framework; and explaining its benefit
to Tightly Knit Communities)
[ CiteSeer@NUS Link ]
|
|
Week 6: (12 Feb)
| Improving Search III - Relations and Passage Retrieval and Tutorial - Retrieval 2
Readings:
- *R.M. Tong, L.A. Appelbaum, V.N. Askman and J.F. Cunningham
(1987). Conceptual information retrieval using RUBRIC. Proceedings of
the 10th ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR'87), New Orleans, Louisiana, USA,
pp. 247-253. (An early paper that discusses how thesaural
knowledge can be integrated to IR; from the pre-WordNet era)
[ ACM Portal Link ]
- *H. Yang, T.S. Chua, S. Wang and C.K. Koh. (2003) Structured use
of external knowledge for event-based open-domain
question-answering. 26th Int'l ACM SIGIR Conference. (Putting
together resources in a unified manner)
[ Link to CMU's copy ]
- *H. Cui, J.R. Wen, J.Y. Nie and W.Y. Ma (2002). Probabilistic
query expansion using query logs. Proceedings of the 11th
International World Wide Web Conference (WWW2002), Honolulu, Hawaii,
USA. (query expansion from another external resource: query
logs)
[ CiteSeer@NUS Link ]
- *H. Yang and T.S. Chua (2003). QUALIFIER: question answering
by lexical fabric and external resources. Proceedings of the 10th
Conference of the European Chapter of the Association for
Computational Linguistics (EACL 03) (Density based methods for
passage retrieval leading up to questions answering)
[ CiteSeer@NUS Link ]
- *H. Cui, R. Sun, K. Li, M.Y. Kan, T.S. Chua (2005). Question
Answering Passage Retrieval Using Dependency Relations. ACM SIGIR,
400-407. (better ranking based on grammatical dependencies between words in a passage)
[ Link to SoC's copy ]
|
|
Mid-semester Break (Mon 19 Feb - Fri 23 Feb 2007)
|
Week 7: (26 Feb)
| Question Answering
Readings:
- *L. Hirschman, M. Light, E. Breck and J.D. Burger (1999). Deep
read: a reading comprehension system. Proceedings of the 37th Meeting
of the Association for Computational Linguistics (ACL'99), College
Park, Maryland, USA, pp. 325-332.
[ CiteSeer@NUS Link ]
- *D. Moldovan and A. Novischi (2002). Lexical chains for question
answering. Proceedings of the 19th International Conference on
Computational Linguistics (COLING 2002), Taipei, Taiwan.
[CiteSeer@NUS Link ]
- *E. Voorhees (2002). Overview of the TREC 2002 Question
Answering Track, In notebook of the Eleventh Text REtrieval Conference
(TREC 2002), 115-123.
[ CiteSeer@NUS Link ]
-
Hang Cui, Min-Yen Kan and Tat-Seng Chua (2004) Unsupervised
Learning of Soft Patterns for Generating Definitions from Online
News. In Proceedings of the 13th International World Wide Web
Conference (WWW2004), May 2004. New York, New York, USA.
[ From Min's Home Page ]
| Assignment #1 due
|
Week 8: (5 Mar)
| Summarization I
Readings:
- *J. Kupiec, J. Pedersen and F. Chen (1995). A trainable document
summarizer. Proceedings of the 18th ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR'95), Seattle,
Washington, USA, pp. 68-73. (The work that took out the heuristic approaches to summarization and made it a learning problem)
[ CiteSeer@NUS Link ]
- *T. Nomoto and Y. Matsumoto (2001). A new approach to
unsupervised text summarization. Proceedings of the 24th ACM SIGIR
Conference on Research and Development in Information Retrieval
(SIGIR'01), New Orleans, Louisiana, USA, pp. 26-34. (Great paper showing a use of X-means clustering for summarization)
[ ACM Portal Link ]
| Assignment #2 out
|
Week 9: (12 Mar)
| Summarization II and Tutorial - Summarization
Readings:
- *G. Erkan and D. Radev (2004) LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. of AI Research. Vol. 22. (Viewing documents as a graph and using PageRank to compute n-best sentences)
[ Link to UMich copy ]
- H. Jing and K. McKeown (2004) The decomposition of human-written summary sentences. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 129-136. (Describes how an HMM can be used to align abstracts to source articles)
[ CiteSeer@NUS Link ]
- K. Knight and D. Marcu (2000) Statistics-Based Summarization Step One: Sentence Compression. Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), pages 703-710. (Combines NLP and the noisy channel model to create a sentence compression scheme)
[ CiteSeer@NUS Link ]
|
|
Week 10: (19 Mar)
| Text Categorization
Readings:
- C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapters 13-14
"Text classification and Naïve Bayes" and "Vector space
classification" of Introduction to Information
Retrieval. Cambridge UP.
[ .pdf of Chapter 13 ] [ .pdf of Chapter 14 ]
[ Chapter 13 slides (.pdf) ] [ Chapter 14 slides (.pdf) ]
- *Y. Yang and J.O. Pedersen (1997). A comparative study on
feature selection in text categorization. Proceedings of the 14th
International Conference on Machine Learning (ICML'97), Nashville,
Tennessee, USA, pp. 412-420.
[ CiteSeer@NUS Link ]
- *Y. Yang and X. Liu (1999). A re-examination of text
categorization methods. Proceedings of the 22nd ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR'99),
Berkeley, California, USA, pp. 42-49.
[ CiteSeer@NUS Link ]
| Assignment #1 returned
|
Week 11: (26 Mar)
| Text Clustering and Tutorial - Categorization
Readings:
|
|
Week 12: (2 Apr)
| Named Entity Recognition
Readings:
- *G. Zhou, J. Su (2002). Named Entity Recognition using an HMM-based Chunk Tagger. Proc. of 40th ACL (ACL '02). pp. 473-480.
[ CiteSeer@NUS Link ]
- *S. Baluja, V. Mittal and R. Sukthankar (1999). Applying machine
learning for high performance named-entity extraction. Pacific
Association for Computational Linguistics (PACLING'99), Waterloo,
Canada.
[ CiteSeer@NUS Link ]
|
|
Week 13: (9 Apr)
| Information Extraction
Readings:
- *C. Cardie (1997). Empirical methods in information
extraction. AI Magazine, 18(4): 65-79. Special Issue on Natural
Language Processing.
[ CiteSeer@NUS Link ]
- *S. Soderland (1999). Learning information extraction rules for
semi-structured and free text. Machine Learning, Vol. 34(1-3),
pp. 233-272.
[ CiteSeer@NUS Link ]
- *J. Xiao, T. S. Chua and J. M. Liu, A Global Rule Induction
Approach to Information Extraction, ICTAI2003.
[ IEEE Xplore Link ]
| Assignment #2 due
|
Reading Week (Sat 14 Apr - 20 Apr 2007)
|
Final Exam (Mon 30 Apr, evening)
|