N.B.: This course is finished. I am maintaining this website for visitor's benefits. The projects below were done by the students in Semester I of 2005/06. Students were required to make a poster presentation, these are the slides that were used. Their final project submission was a paper in the form of a normal conference submission (8-10 page limit). If you have any questions about the project or would like to get a hold of their final report, please email the appropriate student(s). You can also find projects from the earlier versions of this course run in Semester I of 2005/06, 2004/05 and 2003/04.
Completed projects
- Joe Woo, Metadata Harvesting: Solution to e-prints community?!
- Phan Chuong Vo, News articles classification based on URLs
- Bo Chen and Zhenzhou Zhu, LCRA:Effective Semantics for XML Keyword Search
- The Huy Nguyen, Novel Duplicate Detection Method
- Yee Fan Tan and Dave Kor, Estimating Document Frequencies of Collocations in Digital Libraries
- Patrice Choong, Anand Mani and Gang Zou, A comparison of top UK university standing in the real world to their standing in the virtual world
- Viet Bang Nguyen, Analysis of the requirement for reliable answers in web search
- Bernard Tan, Web Advertisement Extraction for Digital Libraries
- Zheng Lu and Raymond Jun Zheng, IR in FAQ System
What is the project?
Projects will be done individually or in groups of two or three.
Note that grading criteria for projects will not differ between
projects based on manpower; individuals and teams of two are often
better coordinated than teams of three, especially in short projects.
A good research project must (i) define a problem (ii) propose a
solution (iii) implement the solution (simulated or real) and (iv)
evaluate againsts any applicable existing solutions or related work.
Your research project can take one of the following
manifestations:
- New research problem/solution - You define a new, interesting
problem and propose a solution. Your solution does not have to be real
good, since you are pioneering a new area of research.
- Existing research problem/new solution - You look at an
existing, interesting problem, and propose a new, novel solution that
is better than existing solutions, which can lead to new ways of
looking/understanding the problem. Your solution doesn't have to
outperform existing methods in all categories but at least in some
particular domain. For example, we are concerned with digital
libraries in this course. It will suffice if your solution for
typical documents in digital libraries is statistically significantly
better than in the more general case.
- Existing research problem/compare existing solutions - You look
at an existing problem and its solutions. Implement the solutions,
compare them and provide new insights to why one solution is better
than another. Provide public-domain software for letting others share
and use your work.
- Build an innovative system - Build a novel application that no
one, or few, have built before. But most importantly, identify new
issues in your system that no existing solutions can adequately
solve.
- Empirical analysis of some collected data - Researchers often
need to build systems that actually solve or improve on real problems.
Papers that analyze the usability of systems or characterize the data
in some way assist others to understand the problem or the clientele
(our users) for a particular problem.
Remember, good research always teaches other researchers something
new.
I do not expect you to write any code from scratch. In fact, if
you have an account on sf3/sunfire, you can access a host of related
software that I use in my research, in the NLP/IR software
repository. Feel free to suggest to me other resources that you
feel would be useful to have installed and available to the class.
Also, please contact me if your quota of disk space is not sufficient
for you to do the scale of research that you need.
A few highlighted resources in the NLP/IR software repository that
can help you do research for this course are:
- NUS SMS corpus - a corpus of SMS messages collected from
students here.
- NUS / Excite query logs - one day logs of (the old) Excite
search engine, and ten months of queries for the NUS LINC system
(parsed and grouped by sessions).
- WT10G - a 10 gigabyte collection of web documents, used for
standard experiments.
- WEKA / SVMlight / Boostexter / etc. - easy to use machine
learning utilities. Probably the easiest to use is Boostexter;
and the most complete one is WEKA. WEKA includes a number of
different machine learners so that one can do a comparative
analysis of different machine learning algorithms on your data.
- Open directory project RDF dump - the data structure and
content of the ODP, a Yahoo! like repository of categorized
websites.
- WebBase statistics - document frequencies for a large portion
of tokens that appear on the web. Can be used for TF*IDF
calculations among other things.
Choosing a project
Below you will find a list of possible final projects. As this is
a seminar, research course, you will be primarily assessed on the work
you do on the final project. As such I expect and demand that each
student/team of students achieve some novel research development or
finding that is not a rehashing of the existing literature. The
midterm survey paper is intended to foster this understanding and
encourage you to poke into new territories.
You are welcomed and encouraged to propose alternate
projects. Your topic should blend together your strengths from your
background, experience and current coursework, yet be applicable to
digital libraries research. I have listed some ideas for projects in
certain areas. Teams that have taken projects that interest them
and/or have relevance to their research or jobs seem to always do
best. Some of the possible projects include (but are
not limited to):
- Social Network Analysis
- Building a better citation parser
- Web hyperlink classification
- Exploring the relationships between prestige, authorities and hubs
- Centrality and density of different genres of websites
- Automatic computation of an area's journal and conference reputations
- Access and Usability Issues
- Multi-object summarization
- The use of VR and immersive environments in the DL
- Efficient social network visualization
- Critique of current approaches in crosswalking of metadata
- Novel querying tools for E-mail, blogs, and IM
- Organizing photo and video content
- User modeling
- Classifying browsing and searching strategies based on
information trails
- Differences in retrieval effectiveness in speech queries
as opposed to text/typed queries
- Conceptual Search / Polysemy and synonymy
- Query expansion and restriction from user query logs
- Characterizing known item queries
- Automatic jargon and terminology canonicalization
- Classification and Filtering
- Automatic ACM classification for theses and technical reports
- Home page interest networking
- Automatic ODP categorization for web sites
- Threading and summarizing blog, email or IM searches
- Digital Library Creation
- GIS: Integration of maps at different scales
- Inferring useful metadata for genres of web documents
- Dateline and timeline history collection and canonicalization
- Digital Library Cataloging and Indexing
- Multimedia Metadata Features
- Digital Library Policy:
- Exploring the integrity of skyreading/skyreading and its
effect on scholarship.
- Cost models for the digital library in specialized domains/forms of media
- Convenience, user rights and usability of linkages in
the digital library
- Authorship Analysis
- Styles and Genres for authorship identification in web pages
- Linkage styles and classification for webpage creators
- Linking SMS and chat log short forms to long forms
I have references some starting references for some of these
topics. You may find it helpful to view past projects by previous
students in earlier versions of this course run in Semester I of 2004/05, 2005/06 and 2003/04.
Project write-up, presentation and grading
Here are some slides on how to do your project proposal.
[ .pdf ] [ .htm ]
Part of the skills that you should practice in a project-based
graduate class is how to report your work. Expert researchers will
tell you that half (if not most) of your time on a project will
involve polishing your paper so it is easy to read and
straightforward. Generally, filling up the page limit is easy, but
deciding what to omit and how to succinctly express your idea is
difficult.
Your team's write-up will take the form of a research paper intended
for a conference submission with a 10 page limit. You should use an
ACM proceedings style (You can follow the instructions for WWW 2004,
for example). You may supplement this with a reference to your
project's website / blog (if one was created) and any amount of
appendices that you feel will help determine a grade. Selected final
projects will be asked to submit their work to a relevant conference
or journal, such as the ones listed on the
miscellaneous page of this site.
On the last class session we will not have class. In lieu of
class, we will meet in the evening for your project presentations on
the 20th of November, as part of this year's Graduate
Course Project Poster Session. Presentations will run for 10
minutes each with an additional five minutes for questions. Please sign
up for a slot time. Only a single group representative needs to
be present. If no one from your group can make the project
presentation timing, please let me know in person or email.
Your project report is due by 11:59:59pm, 20 Nov (Monday).
Standard late penalties apply, so please turn them in on-time.
Grading for the project's final report and presentation are likely to
follow similar weights as ones used in the previous version of this
course: for the presentation, for
research projects and for implementation projects.
Final Workload Disclaimer
The project is the primary method in which you will be assessed
for your course. The workload throughout the rest of the course is
purposely light to ensure that you have enough time to produce
high-quality research in the project. As such you need to budget your
team's time wisely and ensure that you have appropriately scoped your
project and covered the topic with enough detail and with appropriate
evaluation. Part-time students with other commitments need to be
particularly aware of this, as past cases have shown this problem
crops up with part-time students most often.
Some students inevitably start the project too late or mismanage
their time and neglect such open-ended courses, in order to advance in
classes that have more concrete assessment milestones. I warn you now
to budget your time between classes wisely. As this is a four MC
module, there are ten hours of time that a student should allot to
this course. Eight of these are preparation time, and for this course
the bulk of this time is intended for your project. Roughly speaking,
you should invest about 7 weeks * 8 hours/week = 56 hours on your
project.
Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Thu Jun 16 09:04:02 GMT-8 2005
| Version: 1.0
| Last modified:
Tue Nov 21 13:41:24 2006