CS 5246 - Text Processing and the Web

Note that the syllabus and readings are still in flux for the time being. Readings marked with a "*" will be present in the course pack. Readings in small print are primary materials (i.e., original conference and journal papers); read these after the secondary materials (i.e., textbook chapters) if possible.

Aside from our lecture slides, the textbook slides are linked to for your reference -- the links below are to a local copy of the slides given by Hinrich Schütze (by permission), you may want to refer to the original set at http://www-csli.stanford.edu/~hinrich/newslides.html. I do not vouch for the correctness or the material presented in any of the linked slide sets.

The hyperlinks here all work as of Thu Nov 6 10:43:33 SGT 2008 when I last updated this page. Use a search engine with the appropriate text if the links below stop working.

Date	Description	Deadlines
Week 0:	Prerequisites (Please read before coming to class and be familiar with the material) Readings: *P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 1 "Mathematical Foundations" of Modeling the Internet and the Web. Wiley. (Covers basic math foundation needed for the course. The topics introduced here are basically a nutshell of most of the material we will cover in more depth in class. Warning! this is a dense chapter, expect to have to read it a couple times. Contents: Probability from a Bayesian Perspective, Parameter Estimation from Data, Mixture models and the Expectation Maximization Algorithm, Graphical Models, Classification, Clustering, Power Laws)
Week 1: (13 Aug)	Introduction to Web-Based Searches Class Slides: [ .pdf ] Readings: P. Baldi, P. Frasconi and P. Smyth (2003) Chapters 2 and 3 "Basic WWW Technologies" and "Web Graphs" of Modeling the Internet and the Web. Wiley. (You should already be familiar with Chapter 2's material from the Hypermedia or equivalent pre-requisite, so you should spend more time reading Chapter 3's material). [ Chapter 2 slides (.ppt) ] [ Chapter 3 slides (.ppt) ] C. Manning, P. Raghavan and H. Schütze (2008) Chapter 19 "Web Search Basics" of Introduction to Information Retrieval. Cambridge UP. (Caution: may be too advanced for this stage in our course. Skim and re-read closer to the end of the course.) [ .pdf of Chapter 19 ] [ Chapter 19 Slides (.pdf) ] S. Lawrence and C.L. Giles (1999). Accessibility of information on the web. Nature, Vol. 400(8), pp. 107-109. (Short note describing how articles easily available on the internet (self-archived) create larger impact) [ Link ] A-L. Babarasi and R. Albert (1999). Emergence of scaling in random networks. Science, Volume 286. Pre-print. (A good article covering lots of area on evolving networks. Try to understand up to section VIII.) [ ArXiV link ]
Week 2: (20 Aug)	Intro to IR and the Vector-Space Model Class Slides: [ .pdf ] Readings: P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.3 "Content-Based Ranking" of Modeling the Internet and the Web. Wiley. (There is a link to the .pdf for the whole of Chapter 4 provided by the authors as their sample chapter. We will be using this chapter as the basic overview for the next week.) [ .pdf of Chapter 4 from UC Irvine ] [ Chapter 4 slides (.ppt) ] C. Manning, P. Raghavan and H. Schütze (2008) Chapters 6-9 "Scoring, term weighting & the vector space model", "Computing scores in a complete search system", "Evaluation in information retrieval" and "Relevance feedback and query expansion" of Introduction to Information Retrieval. Cambridge UP. (Covers the same as the Baldi book but in slightly more depth). [ .pdf of Chapter 6 ] [ .pdf of Chapter 7 ] [ .pdf of Chapter 8 ] [ .pdf of Chapter 9 ] [ Chapter 6 Slides (.pdf) ] [ Chapter 7 Slides (.pdf) ] [ Chapter 8 Slides (.pdf) ] [ Chapter 9 Slides (.pdf) ] G. Salton (1972). Dynamic document processing. Communications of the ACM, Vol. 15(7), pp. 658-668. (One of the original papers from the inventors of IR. Check it out!) [ ACM Portal Link ] K. Järvelin and J. Kekäläinen (2002). Cumulated gain-based evaluation of IR techniques, Transactions on Information Systems. (The paper that described the newer evaluation metric nDCG). [ .pdf from uta.fi ]
Week 3: (27 Aug)	Probabilistic IR Model and Language Modeling Class Slides: [ .pdf ] Readings: C. Manning, P. Raghavan and H. Schütze (2008) Chapters 11-12 "Probabilistic Information Retrieval" and "Language Models for Information Retrieval" of Introduction to Information Retrieval. Cambridge UP. (These two chapters should be considered the primary source for this week; skip 11.3.4, 11.4.2-11.5, 12.4) [ .pdf of Chapter 11 ] [ .pdf of Chapter 12 ] P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.4 "Probablistic Retrieval" of Modeling the Internet and the Web. Wiley. K. Sparck Jones, S. Walker and S.E. Robertson (1998). A probabilistic model of information retrieval: development and status. Technical Report 446, Cambridge University Computer Laboratory. (This is a very complete description of probabilistic IR from the people who pioneered it; you can just read Sections 2 & 4; if you want to know more about relevance feedback, read Sections 5 and 6) [ SIG.IR NUS Link ] J.M. Ponte and W.B. Croft (1998) A language modeling approach to information retrieval. ACM SIGIR 1998, pp 275-281. (Discusses the language modeling approach to IR -- still much more to be done here with increasingly large data sets) [ CiteSeerX Link ]	Homework #1 Announced
Week 4: (3 Sep)	Improving Search I - Dimensionality Reduction and Tutorial 0 / Make Up Lecture 1 Class Slides: [ .pdf ] Supplement on Linear Algebra: [ .pdf ] Readings: P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.5 "Latent Semantic Analysis" Modeling the Internet and the Web. Wiley. C. Manning, P. Raghavan and H. Schütze (2008) Chapter 18 "Dimensionality Reduction and Latent Semantic Indexing" of Introduction to Information Retrieval. Cambridge UP. (Covers the same material as the Baldi et al. book, but in more depth) [ .pdf of Chapter 18 ] S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, Vol. 41(6), pp. 391-407. (An expanded version of the original paper that pioneered dimensionality reduction) [ CiteSeer@NUS Link ] T. Hofmann (1999) Probabilistic latent semantic indexing. ACM SIGIR 99. (The breakthrough paper that is the basis for newer Bayesian analysis to dimensionality reduction) [ ACM Portal Link ] *D. Blei, A.Y. Ng, and M.I. Jordan (2003) Latent Dirichlet Allocation. In J. of A.I. Research. (a three level Bayesian model that was applied to pLSI. This paper is now used as a basis for much further NLP and IR research under the Bayesian framework.) [ Link from Stanford ]
Week 5: (10 Sep)	Improving Search II - Use of Links and Structures Class Slides: [ .pdf ] Readings: P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 5 "Link Analysis" in Modeling the Internet and the Web. Wiley. [ Chapter 5 slides (.ppt) ] C. Manning, P. Raghavan and H. Schütze (2008) Chapter 21 "Link Analysis" of Introduction to Information Retrieval. Cambridge UP. [ .pdf of Chapter 21 ] [ Chapter 21 Slides (.pdf ] *S. Brin and L. Page (1998). The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th International World Wide Web Conference (WWW7), Brisbane, Australia, pp. 107-117. (This is the original paper on the PageRank algorithm) [ CiteSeer@NUS link ] T.H. Haveliwala (2002). Topic-Sensitive PageRank. Proceedings of the 11th International World Wide Web Conference (WWW2002), Honolulu, Hawaii, USA. (Making PageRank biased to some "basis" topics by playing with the teleportation factor. Also used for detecting spam using "trusted" sites as bases.) [ CiteSeerX Link ] J. Kleinberg (1998). Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. (Describes HITS; when we decouple both ends of the directed edge in calculating prestige) [ CiteSeerX Link ] R. Lempel and S. Moran (2000). The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Proceedings of WWW 9 (1999). (Bringing Kleinberg's HITS to a bipartite framework; and explaining its benefit to Tightly Knit Communities) [ CiteSeerX Link ]
Week 6: (17 Sep)	Improving Search III - Relations and Passage Retrieval and Tutorial - Retrieval 1 (directly after class) Class Slides: [ .pdf ] Readings: H. Cui, R. Sun, K. Li, M.Y. Kan, T.S. Chua (2005). Question Answering Passage Retrieval Using Dependency Relations. ACM SIGIR, 400-407. (better ranking based on grammatical dependencies between words in a passage) [ Link from Min's page ] H. Cui, J.R. Wen, J.Y. Nie and W.Y. Ma (2002). Probabilistic query expansion using query logs. Proceedings of the 11th International World Wide Web Conference (WWW2002), Honolulu, Hawaii, USA. (query expansion from another external resource: query logs) [ CiteSeerX Link ] R.M. Tong, L.A. Appelbaum, V.N. Askman and J.F. Cunningham (1987). Conceptual information retrieval using RUBRIC. Proceedings of the 10th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'87), New Orleans, Louisiana, USA, pp. 247-253. (An early paper that discusses how thesaural knowledge can be integrated to IR; from the pre-WordNet era) [ ACM Portal Link ] H. Yang, T.S. Chua, S. Wang and C.K. Koh. (2003) Structured use of external knowledge for event-based open-domain question-answering. 26th Int'l ACM SIGIR Conference. (Putting together resources in a unified manner) [ Link to CMU's copy ] *S. Tellex, B. Katz, J. Lin, A. Fernandes and G.Marton (2003) Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering. (Survey paper on passage retrieval methods; used in class for example ranking exercise) [ MIT CSAIL Link ]
Recess Week (Sat 20 Sep - Sun 28 Sep)
Holiday - Hari Raya Puasa (1 Oct)
Week 7: (Note special date and location; 2 Oct 6:30-8:30pm SR6, make up lecture)	Question Answering Class Slides: [ .pdf ] Readings: L. Hirschman, M. Light, E. Breck And J.D. Burger (1999). Deep Read: a Reading Comprehension System. Proceedings of the 37th Meeting of the Association for Computational Linguistics (ACL'99), College Park, Maryland, USA, pp. 325-332. [ ACL Anthology Link ] H. Yang and T.S. Chua (2003). QUALIFIER: question answering by lexical fabric and external resources. Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 03) (Density based methods for passage retrieval leading up to questions answering) [ CiteSeerX Link ] D. Moldovan and A. Novischi (2002). Lexical Chains for Question Answering. Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan. [ ACL Anthology Link ] E. Voorhees (2002). Overview of the TREC 2002 Question Answering Track, In Notebook of the Eleventh Text Retrieval Conference (TREC 2002), 115-123. [ CiteseerX Link ] Hang Cui, Min-Yen Kan and Tat-Seng Chua (2004) Unsupervised Learning of Soft Patterns for Generating Definitions from Online News. In Proceedings of the 13th International World Wide Web Conference (WWW2004), May 2004. New York, New York, USA. [ Link from Min's Page ]	Homework #1 Due Homework #2 Announced
Week 8: (8 Oct)	Summarization and Tutorial - Retrieval 2 Class Slides: [ .pdf ] Readings: J. Kupiec, J. Pedersen and F. Chen (1995). A trainable document summarizer. Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), Seattle, Washington, USA, pp. 68-73. (The work that took out the heuristic approaches to summarization and made it a learning problem) [ ACM Portal Link ] T. Nomoto and Y. Matsumoto (2001). A new approach to unsupervised text summarization. Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), New Orleans, Louisiana, USA, pp. 26-34. (Great paper showing a use of X-means clustering for summarization) [ ACM Portal Link ] G. Erkan and D. Radev (2004) LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. of AI Research. Vol. 22. (Viewing documents as a graph and using PageRank to compute n-best sentences) [ Link to UMich copy ] H. Jing and K. McKeown (2004) The decomposition of human-written summary sentences. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 129-136. (Describes how an HMM can be used to align abstracts to source articles) [ CiteSeerX Link ] *K. Knight and D. Marcu (2000) Statistics-Based Summarization Step One: Sentence Compression. Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), pages 703-710. (Combines NLP and the noisy channel model to create a sentence compression scheme) [ CiteSeerX Link ]
Week 9: (15 Oct)	Text Categorization and Tutorial - Summarization (directly after class) Class Slides: [ .pdf ] Readings: C. Manning, P. Raghavan and H. Schütze (2008) Chapters 13-14 "Text classification and Naïve Bayes" and "Vector space classification" of Introduction to Information Retrieval. Cambridge UP. [ .pdf of Chapter 13 ] [ .pdf of Chapter 14 ] [ Chapter 13 Slides (.pdf) ] [ Chapter 14 Slides (.pdf) ] Y. Yang and J.O. Pedersen (1997). A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML'97), Nashville, Tennessee, USA, pp. 412-420. [ CiteSeerX Link ] Y. Yang and X. Liu (1999). A re-examination of text categorization methods. Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), Berkeley, California, USA, pp. 42-49. [ CiteSeerX Link ] L. Man, C.-L. Tan, H.-B. Low, S.-Y. Sung (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines, Poster Paper in WWW '05. (Discusses text classification sensitive weighting schemes) [ WWW 2005 Link ] J.M. Liu and T.S. Chua (2001). Building semantic perceptron net for topic spotting. Proceedings of the 39th Meeting of Association of Computational Linguistics (ACL'01), Toulouse, France, pp. 370-377. (Hierarchical version of text classification using perceptrons) [ ACL Anthology Link ]
Week 10: (22 Oct)	Text Clustering and Tutorial - Categorization (directly after class) Class Slides: [ .pdf ] Readings: C. Manning, P. Raghavan and H. Schütze (2008) Chapters 16-17 "Flat Clustering" and "Hierarchical Clustering" of Introduction to Information Retrieval. Cambridge UP. [ .pdf of Chapter 16 ] [ .pdf of Chapter 17 ] [ Chapter 16 Slides (.pdf) ] [ Chapter 17 Slides (.pdf) ] *Michael Steinbach, George Karypis and Vipin Kumar (2000) A Comparison of Document Clustering Techniques. TextMining Workshop. KDD. 2000. [ Link from Vipin Kumar (.ps format) ]	Homework #1 Returned
Week 11: (29 Oct)	Sequence Labeling I and Named Entity Recognition Class Slides: [ .pdf ] Experimentation: Predicting Baltimore Weather from Ice Cream Consumption (.xls) - the Forward Backward estimation algorithm. Used in class, but which you should also experiment with on your own. The Viterbi decoding version. J. Eisner (2002). An interactive spreadsheet for teaching the forward-backward algorithm. In Proc. of the ACL Workshop on Effective Tools and Methodologies for Teaching NLP and CL, pages 10-18, Philadelphia, July. (Covers how to use this spreadsheet for teaching HMM's forward and backward algorithm. Go through the exercises mentioned to make sure you understand them. Readings: G. Zhou and J. Su (2002). Named Entity Recognition using an HMM-based Chunk Tagger. Proc. of 40th ACL (ACL '02). pp. 473-480. [ ACL Anthology Link ] *S Cucerzan, D. Yarowsky (1999). Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. Proc. of EMNLP 1999. [ ACL Anthology Link ] Also see the excellent post by Chris Manning and subsequent commentary: Doing Named Entity Recognition? Don't optimize for F₁.
Week 12: (5 Nov)	Sequence Labeling II and Information Extraction Class Slides: [ .pdf ] Readings: J. Lafferty, A. McCallum, F. Pereira (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. International Conf. on Machine Learning. [ Link from CiteSeer@NUS ] A. McCallum, D. Freitag, F. Pereira (2000) Maximum entropy markov models for information extraction and segmentation. International Conf. on Machine Learning. [ Link from CiteSeerX ] C. Cardie (1997). Empirical methods in information extraction. AI Magazine, 18(4): 65-79. Special Issue on Natural Language Processing. [ CiteSeer@NUS Link ] S. Soderland (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, Vol. 34(1-3), pp. 233-272. [ CiteSeer@NUS Link ] J. Xiao, T. S. Chua and J. M. Liu, A Global Rule Induction Approach to Information Extraction, ICTAI2003. [ IEEE Xplore Link ]	Homework #2 Due
Week 13: (12 Nov)	Learning To Rank and Revision Class Slides: [ .pdf ] Readings: C. Manning, P. Raghavan and H. Schütze (2008) Section 15.4 "Machine-learning methods in ad hoc information retrieval" of Introduction to Information Retrieval. Cambridge UP. (Covers pointwise and pairwise versions of the Learning to Rank (LeToR) paradigm, refer to the other papers below for specifics; warning: you'll need to have a good grasp of statistics and machine learning to read the supplemental papers, consider reading all of Chapter 15 in the text first, and possibly a good machine learning book before attempting the below papers in detail.) [ .pdf of Chapter 15 ] T.-Y. Liu (2008) Learning To Rank Tutorial Slideset. Presented at WWW 08 and SIGIR 2008. (The much more comprehensive slide set for the LeToR tutorials which compose most of the source of today's slides.) [ Link to Tie-Yan's page at MSRA (click on the .ppt logos at the bottom of the page) ] N. Fuhr (1989) Optimum polynomial retrieval functions based on the probability ranking principle, In ACM Trans. on Information Systems [ ACM Portal Link ] * W. W. Cohen, R. E. Shapire, Y. Singer (1999) Learning to order things, Journal of Artificial intelligence research (JAIR) [ Link from CSAIL, MIT ] T. Joachims (2002) Optimizing Search Engines Using Clickthrough Data, In Proc. of KDD [ Link from Joachim's Website ] R. Nallapati (2004) Discriminative models for information retrieval, In Proc of SIGIR [ Link to ACM Portal ] *Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai and H. Li (2007) Learning to Rank: From Pairwise to Listwise Approach, In Proc. of ICML. [ Link from Microsoft Research ]
Reading Week (Sat 15 Nov - Fri 21 Nov)
Final Exam (26 Nov, 7:30 SR3 (COM1 02-12))

Min-Yen Kan <kanmy@comp.nus.edu.sg> Created on: Sun Jul 13 15:47:07 SGT 2008 | RCS: $Id: syllabus.html,v 1.2 2004/08/11 06:00:38 kanmy Exp kanmy $ | Version: 1.0 | Last modified: Mon Nov 24 09:12:46 2008

CS 5246 - Text Processing and the Web

Menu