Search Engine Driven
Author Disambiguation
Yee Fan Tan and Min-Yen Kan
Department of Computer Science
National University of Singapore, Singapore
{tanyeefa,kanmy}@comp.nus.edu.sg

Dongwon Lee
College of Information Sciences and Technology
The Pennsylvania State University, USA
dongwon@psu.edu

Introduction
Bibliographic digital libraries
Contains a large number of publication metadata records
e.g. Citeseer, DBLP
Commonly used to measure impact of researchers on community
Problem
What happens when different authors share the same name?

Motivation @ DBLP

Slide 4

Author disambiguation with mixed citations
Given:
An author name string X representing k unique individuals
A list of citations C containing the name X
The task:
For each citation c in C, determine which of these k individuals c belongs to
K-way classification or clustering problem

Internal Resources
Past work (Lee et al. 05, Han et al. 05) used internal resources:
Knowledge encoded in the records themselves
Used field similarity, common co-author strings for clustering
Problems with using only internal resources
May provide insufficient information or difficult to extract
e.g., two papers on the same topic using disjoint keywords in their titles
Therefore, we use resources external to the citation data

Research Hypothesis
Hypothesis: Using external resources as in URL would help disambiguate author names with mixed citation
  Many factors to consider:
  Which external resources to use: URL, web page contents, affiliation, etc
  How to use: both internal and external? How to mix?
  How to apply external resources? Weighting?
  Preliminary study focuses on the case using URL and simple weighting

External Resources
Lay people doing this task with unfamiliar publications may use a search engine, using paper title as query
Our method tries to approximate this
For each citation c in C
Query search engine with title of c as phrase search to obtain a set of relevant URLs
Represent c by a feature vector of relevant URLs and weighting scheme
Apply hierarchical agglomerative clustering (HAC) on C to derive k clusters
Cosine similarity
Tested with single link, complete link and group average

Weighting: Inverse Host Frequency (IHF)
Observation
Not all URLs are equally useful
e.g., aggregator services
Desired weighting scheme
Low weights to aggregator web sites
High weights to personal and group publication pages
Inverse Host Frequency (IHF)
Similar to Inverse Document Frequency (IDF) in information retrieval
Consider citations of top 100 authors in DBLP (by number of citations)
For each such citation, query search engine with its title to obtain URLs, truncate them to their hostnames
If a hostname h has frequency f(h), then its IHF is

Weighting: Inverse Host Frequency (IHF)
We notice that using hostnames alone may be problematic
Especially when a host has multiple hostnames or is represented by an IP address with dissimilar distributions
e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and 136.199.54.185 are the same host
Therefore, we also experimented with
Domain (e.g. uni-trier.de)
Resolving hostnames to IP addresses

Evaluation
Dataset
Manually-disambiguated dataset of 24 ambiguous names in computer science domain
Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3
Each name is attributed to 30 citations on average
Proportion of largest class ranges from 50% to 97%
Search engine
Google (http://www.google.com/)

Evaluation
Single link performs best
Good for clustering citations from different publication pages together (some pages list only selected publications)
Some authors have disparate research areas, not well represented by a centroid vector
Resolving hostnames to IP addresses give best accuracy

Comparison to [Lee et al. 05, IQIS]

Discussion

Discussion
Apparent correlation between accuracy and average number of URLs returned per citation
Author names with few URLs tend to fare poorly since results are mainly aggregator web sites
We do not observe any apparent relation between accuracy and number of citations for an author name
Our algorithm is scalable for large number of citations
Analysis of returned URLs is very fast, execution time is dominated by search engine querying
Querying may already be done while spidering, so our algorithm is time-efficient

Conclusion
Summary
We focused on using URLs returned from searching citation titles
Respectable average accuracy of 0.836 using IP addresses with single link HAC clustering
Future work
Explore other sources of information, such as the publication venues of the citations as well as utilizing the actual contents of the web pages
Combine knowledge gained externally and internally to obtain improved performance