Search Engine
Driven
Author Disambiguation
Yee Fan Tan and Min-Yen Kan | |
Department of Computer
Science National University of Singapore, Singapore |
|
{tanyeefa,kanmy}@comp.nus.edu.sg | |
Dongwon Lee |
|
College of Information Sciences
and Technology The Pennsylvania State University, USA |
|
dongwon@psu.edu |
Bibliographic digital libraries | ||
Contains a large number of publication metadata records | ||
e.g. Citeseer, DBLP | ||
Commonly used to measure impact of researchers on community | ||
Problem | ||
What happens when different authors share the same name? |
Author disambiguation with mixed citations
Given: | ||
An author name string X representing k unique individuals | ||
A list of citations C containing the name X | ||
The task: | ||
For each citation c in C, determine which of these k individuals c belongs to | ||
K-way classification or clustering problem |
Past work (Lee et al. 05, Han et al. 05) used internal resources: | ||
Knowledge encoded in the records themselves | ||
Used field similarity, common co-author strings for clustering | ||
Problems with using only internal resources | ||
May provide insufficient information or difficult to extract | ||
e.g., two papers on the same topic using disjoint keywords in their titles | ||
Therefore, we use resources external to the citation data |
Hypothesis: Using external resources as in URL would help disambiguate author names with mixed citation | ||
Many factors to consider: | ||
Which external resources to use: URL, web page contents, affiliation, etc | ||
How to use: both internal and external? How to mix? | ||
How to apply external resources? Weighting? | ||
Preliminary study focuses on the case using URL and simple weighting | ||
Lay people doing this task with unfamiliar publications may use a search engine, using paper title as query | ||
Our method tries to approximate this | ||
For each citation c in C | ||
Query search engine with title of c as phrase search to obtain a set of relevant URLs | ||
Represent c by a feature vector of relevant URLs and weighting scheme | ||
Apply hierarchical agglomerative clustering (HAC) on C to derive k clusters | ||
Cosine similarity | ||
Tested with single link, complete link and group average |
Weighting: Inverse Host Frequency (IHF)
Observation | ||
Not all URLs are equally useful | ||
e.g., aggregator services | ||
Desired weighting scheme | ||
Low weights to aggregator web sites | ||
High weights to personal and group publication pages | ||
Inverse Host Frequency (IHF) | ||
Similar to Inverse Document Frequency (IDF) in information retrieval | ||
Consider citations of top 100 authors in DBLP (by number of citations) | ||
For each such citation, query search engine with its title to obtain URLs, truncate them to their hostnames | ||
If a hostname h has frequency f(h), then its IHF is |
Weighting: Inverse Host Frequency (IHF)
We notice that using hostnames alone may be problematic | ||
Especially when a host has multiple hostnames or is represented by an IP address with dissimilar distributions | ||
e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and 136.199.54.185 are the same host | ||
Therefore, we also experimented with | ||
Domain (e.g. uni-trier.de) | ||
Resolving hostnames to IP addresses |
Dataset | ||
Manually-disambiguated dataset of 24 ambiguous names in computer science domain | ||
Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3 | ||
Each name is attributed to 30 citations on average | ||
Proportion of largest class ranges from 50% to 97% | ||
Search engine | ||
Google (http://www.google.com/) |
Single link performs best | ||
Good for clustering citations from different publication pages together (some pages list only selected publications) | ||
Some authors have disparate research areas, not well represented by a centroid vector | ||
Resolving hostnames to IP addresses give best accuracy |
Comparison to [Lee et al. 05, IQIS]
Apparent correlation between accuracy and average number of URLs returned per citation | ||
Author names with few URLs tend to fare poorly since results are mainly aggregator web sites | ||
We do not observe any apparent relation between accuracy and number of citations for an author name | ||
Our algorithm is scalable for large number of citations | ||
Analysis of returned URLs is very fast, execution time is dominated by search engine querying | ||
Querying may already be done while spidering, so our algorithm is time-efficient |
Summary | ||
We focused on using URLs returned from searching citation titles | ||
Respectable average accuracy of 0.836 using IP addresses with single link HAC clustering | ||
Future work | ||
Explore other sources of information, such as the publication venues of the citations as well as utilizing the actual contents of the web pages | ||
Combine knowledge gained externally and internally to obtain improved performance |