Search Engine
Driven
Author Disambiguation
| Yee Fan Tan and Min-Yen Kan | |
| Department of Computer
Science National University of Singapore, Singapore |
|
| {tanyeefa,kanmy}@comp.nus.edu.sg | |
Dongwon Lee |
|
| College of Information Sciences
and Technology The Pennsylvania State University, USA |
|
| dongwon@psu.edu |
| Bibliographic digital libraries | ||
| Contains a large number of publication metadata records | ||
| e.g. Citeseer, DBLP | ||
| Commonly used to measure impact of researchers on community | ||
| Problem | ||
| What happens when different authors share the same name? | ||
Author disambiguation with mixed citations
| Given: | ||
| An author name string X representing k unique individuals | ||
| A list of citations C containing the name X | ||
| The task: | ||
| For each citation c in C, determine which of these k individuals c belongs to | ||
| K-way classification or clustering problem | ||
| Past work (Lee et al. 05, Han et al. 05) used internal resources: | ||
| Knowledge encoded in the records themselves | ||
| Used field similarity, common co-author strings for clustering | ||
| Problems with using only internal resources | ||
| May provide insufficient information or difficult to extract | ||
| e.g., two papers on the same topic using disjoint keywords in their titles | ||
| Therefore, we use resources external to the citation data | ||
| Hypothesis: Using external resources as in URL would help disambiguate author names with mixed citation | ||
| Many factors to consider: | ||
| Which external resources to use: URL, web page contents, affiliation, etc | ||
| How to use: both internal and external? How to mix? | ||
| How to apply external resources? Weighting? | ||
| Preliminary study focuses on the case using URL and simple weighting | ||
| Lay people doing this task with unfamiliar publications may use a search engine, using paper title as query | ||
| Our method tries to approximate this | ||
| For each citation c in C | ||
| Query search engine with title of c as phrase search to obtain a set of relevant URLs | ||
| Represent c by a feature vector of relevant URLs and weighting scheme | ||
| Apply hierarchical agglomerative clustering (HAC) on C to derive k clusters | ||
| Cosine similarity | ||
| Tested with single link, complete link and group average | ||
Weighting: Inverse Host Frequency (IHF)
| Observation | ||
| Not all URLs are equally useful | ||
| e.g., aggregator services | ||
| Desired weighting scheme | ||
| Low weights to aggregator web sites | ||
| High weights to personal and group publication pages | ||
| Inverse Host Frequency (IHF) | ||
| Similar to Inverse Document Frequency (IDF) in information retrieval | ||
| Consider citations of top 100 authors in DBLP (by number of citations) | ||
| For each such citation, query search engine with its title to obtain URLs, truncate them to their hostnames | ||
| If a hostname h has frequency f(h), then its IHF is | ||
Weighting: Inverse Host Frequency (IHF)
| We notice that using hostnames alone may be problematic | ||
| Especially when a host has multiple hostnames or is represented by an IP address with dissimilar distributions | ||
| e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and 136.199.54.185 are the same host | ||
| Therefore, we also experimented with | ||
| Domain (e.g. uni-trier.de) | ||
| Resolving hostnames to IP addresses | ||
| Dataset | ||
| Manually-disambiguated dataset of 24 ambiguous names in computer science domain | ||
| Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3 | ||
| Each name is attributed to 30 citations on average | ||
| Proportion of largest class ranges from 50% to 97% | ||
| Search engine | ||
| Google (http://www.google.com/) | ||
| Single link performs best | ||
| Good for clustering citations from different publication pages together (some pages list only selected publications) | ||
| Some authors have disparate research areas, not well represented by a centroid vector | ||
| Resolving hostnames to IP addresses give best accuracy | ||
Comparison to [Lee et al. 05, IQIS]
| Apparent correlation between accuracy and average number of URLs returned per citation | ||
| Author names with few URLs tend to fare poorly since results are mainly aggregator web sites | ||
| We do not observe any apparent relation between accuracy and number of citations for an author name | ||
| Our algorithm is scalable for large number of citations | ||
| Analysis of returned URLs is very fast, execution time is dominated by search engine querying | ||
| Querying may already be done while spidering, so our algorithm is time-efficient | ||
| Summary | ||
| We focused on using URLs returned from searching citation titles | ||
| Respectable average accuracy of 0.836 using IP addresses with single link HAC clustering | ||
| Future work | ||
| Explore other sources of information, such as the publication venues of the citations as well as utilizing the actual contents of the web pages | ||
| Combine knowledge gained externally and internally to obtain improved performance | ||