1
|
- Yee Fan Tan and Min-Yen Kan
- Department of Computer Science
National University of Singapore, Singapore
- {tanyeefa,kanmy}@comp.nus.edu.sg
Dongwon Lee
- College of Information Sciences and Technology
The Pennsylvania State University, USA
- dongwon@psu.edu
|
2
|
- Bibliographic digital libraries
- Contains a large number of publication metadata records
- e.g. Citeseer, DBLP
- Commonly used to measure impact of researchers on community
- Problem
- What happens when different authors share the same name?
|
3
|
|
4
|
|
5
|
- Given:
- An author name string X representing k unique individuals
- A list of citations C containing the name X
- The task:
- For each citation c in C, determine which of these k individuals c
belongs to
- K-way classification or clustering problem
|
6
|
- Past work (Lee et al. 05, Han et al. 05) used internal resources:
- Knowledge encoded in the records themselves
- Used field similarity, common co-author strings for clustering
- Problems with using only internal resources
- May provide insufficient information or difficult to extract
- e.g., two papers on the same topic using disjoint keywords in their
titles
- Therefore, we use resources external to the citation data
|
7
|
- Hypothesis: Using external resources as in URL would help disambiguate
author names with mixed citation
- Many factors to consider:
- Which external resources to
use: URL, web page contents, affiliation, etc
- How to use: both internal and
external? How to mix?
- How to apply external
resources? Weighting?
- Preliminary study focuses on the
case using URL and simple weighting
|
8
|
- Lay people doing this task with unfamiliar publications may use a search
engine, using paper title as query
- Our method tries to approximate this
- For each citation c in C
- Query search engine with title of c as phrase search to obtain a set of
relevant URLs
- Represent c by a feature vector of relevant URLs and weighting scheme
- Apply hierarchical agglomerative clustering (HAC) on C to derive k
clusters
- Cosine similarity
- Tested with single link, complete link and group average
|
9
|
- Observation
- Not all URLs are equally useful
- e.g., aggregator services
- Desired weighting scheme
- Low weights to aggregator web sites
- High weights to personal and group publication pages
- Inverse Host Frequency (IHF)
- Similar to Inverse Document Frequency (IDF) in information retrieval
- Consider citations of top 100 authors in DBLP (by number of citations)
- For each such citation, query search engine with its title to obtain
URLs, truncate them to their hostnames
- If a hostname h has frequency f(h), then its IHF is
|
10
|
- We notice that using hostnames alone may be problematic
- Especially when a host has multiple hostnames or is represented by an
IP address with dissimilar distributions
- e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and
136.199.54.185 are the same host
- Therefore, we also experimented with
- Domain (e.g. uni-trier.de)
- Resolving hostnames to IP addresses
|
11
|
- Dataset
- Manually-disambiguated dataset of 24 ambiguous names in computer
science domain
- Each ambiguous name represented 2 unique authors (k = 2) except for one
where it represented 3
- Each name is attributed to 30 citations on average
- Proportion of largest class ranges from 50% to 97%
- Search engine
- Google (http://www.google.com/)
|
12
|
- Single link performs best
- Good for clustering citations from different publication pages together
(some pages list only selected publications)
- Some authors have disparate research areas, not well represented by a
centroid vector
- Resolving hostnames to IP addresses give best accuracy
|
13
|
|
14
|
|
15
|
- Apparent correlation between accuracy and average number of URLs
returned per citation
- Author names with few URLs tend to fare poorly since results are mainly
aggregator web sites
- We do not observe any apparent relation between accuracy and number of
citations for an author name
- Our algorithm is scalable for large number of citations
- Analysis of returned URLs is very fast, execution time is dominated by
search engine querying
- Querying may already be done while spidering, so our algorithm is
time-efficient
|
16
|
- Summary
- We focused on using URLs returned from searching citation titles
- Respectable average accuracy of 0.836 using IP addresses with single
link HAC clustering
- Future work
- Explore other sources of information, such as the publication venues of
the citations as well as utilizing the actual contents of the web pages
- Combine knowledge gained externally and internally to obtain improved
performance
|