1	Search Engine Driven Author Disambiguation Yee Fan Tan and Min-Yen Kan Department of Computer Science National University of Singapore, Singapore {tanyeefa,kanmy}@comp.nus.edu.sg Dongwon Lee College of Information Sciences and Technology The Pennsylvania State University, USA dongwon@psu.edu
2	Introduction Bibliographic digital libraries Contains a large number of publication metadata records e.g. Citeseer, DBLP Commonly used to measure impact of researchers on community Problem What happens when different authors share the same name?
3	Motivation @ DBLP
4
5	Author disambiguation with mixed citations Given: An author name string X representing k unique individuals A list of citations C containing the name X The task: For each citation c in C, determine which of these k individuals c belongs to K-way classification or clustering problem
6	Internal Resources Past work (Lee et al. 05, Han et al. 05) used internal resources: Knowledge encoded in the records themselves Used field similarity, common co-author strings for clustering Problems with using only internal resources May provide insufficient information or difficult to extract e.g., two papers on the same topic using disjoint keywords in their titles Therefore, we use resources external to the citation data
7	Research Hypothesis Hypothesis: Using external resources as in URL would help disambiguate author names with mixed citation Many factors to consider: Which external resources to use: URL, web page contents, affiliation, etc How to use: both internal and external? How to mix? How to apply external resources? Weighting? Preliminary study focuses on the case using URL and simple weighting
8	External Resources Lay people doing this task with unfamiliar publications may use a search engine, using paper title as query Our method tries to approximate this For each citation c in C Query search engine with title of c as phrase search to obtain a set of relevant URLs Represent c by a feature vector of relevant URLs and weighting scheme Apply hierarchical agglomerative clustering (HAC) on C to derive k clusters Cosine similarity Tested with single link, complete link and group average
9	Weighting: Inverse Host Frequency (IHF) Observation Not all URLs are equally useful e.g., aggregator services Desired weighting scheme Low weights to aggregator web sites High weights to personal and group publication pages Inverse Host Frequency (IHF) Similar to Inverse Document Frequency (IDF) in information retrieval Consider citations of top 100 authors in DBLP (by number of citations) For each such citation, query search engine with its title to obtain URLs, truncate them to their hostnames If a hostname h has frequency f(h), then its IHF is
10	Weighting: Inverse Host Frequency (IHF) We notice that using hostnames alone may be problematic Especially when a host has multiple hostnames or is represented by an IP address with dissimilar distributions e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and 136.199.54.185 are the same host Therefore, we also experimented with Domain (e.g. uni-trier.de) Resolving hostnames to IP addresses
11	Evaluation Dataset Manually-disambiguated dataset of 24 ambiguous names in computer science domain Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3 Each name is attributed to 30 citations on average Proportion of largest class ranges from 50% to 97% Search engine Google (http://www.google.com/)
12	Evaluation Single link performs best Good for clustering citations from different publication pages together (some pages list only selected publications) Some authors have disparate research areas, not well represented by a centroid vector Resolving hostnames to IP addresses give best accuracy
13	Comparison to [Lee et al. 05, IQIS]
14	Discussion
15	Discussion Apparent correlation between accuracy and average number of URLs returned per citation Author names with few URLs tend to fare poorly since results are mainly aggregator web sites We do not observe any apparent relation between accuracy and number of citations for an author name Our algorithm is scalable for large number of citations Analysis of returned URLs is very fast, execution time is dominated by search engine querying Querying may already be done while spidering, so our algorithm is time-efficient
16	Conclusion Summary We focused on using URLs returned from searching citation titles Respectable average accuracy of 0.836 using IP addresses with single link HAC clustering Future work Explore other sources of information, such as the publication venues of the citations as well as utilizing the actual contents of the web pages Combine knowledge gained externally and internally to obtain improved performance