Search Engine Driven
Author Disambiguation

Yee Fan Tan and Min-Yen Kan

Department of Computer Science
National University of Singapore, Singapore

{tanyeefa,kanmy}@comp.nus.edu.sg

Dongwon Lee

College of Information Sciences and Technology
The Pennsylvania State University, USA

dongwon@psu.edu

Introduction

Bibliographic digital libraries

Contains a large number of publication metadata records

e.g. Citeseer, DBLP

Commonly used to measure impact of researchers on community

Problem

What happens when different authors share the same name?

Slide 4

Author disambiguation with mixed citations

Given:

An author name string X representing k unique individuals

A list of citations C containing the name X

The task:

For each citation c in C, determine which of these k individuals c belongs to

K-way classification or clustering problem

Internal Resources

Past work (Lee et al. 05, Han et al. 05) used internal resources:

Knowledge encoded in the records themselves

Used field similarity, common co-author strings for clustering

Problems with using only internal resources

May provide insufficient information or difficult to extract

e.g., two papers on the same topic using disjoint keywords in their titles

Therefore, we use resources external to the citation data

Research Hypothesis

Hypothesis: Using external resources as in URL would help disambiguate author names with mixed citation

Many factors to consider:

Which external resources to use: URL, web page contents, affiliation, etc

How to use: both internal and external? How to mix?

How to apply external resources? Weighting?

Preliminary study focuses on the case using URL and simple weighting

External Resources

Lay people doing this task with unfamiliar publications may use a search engine, using paper title as query

Our method tries to approximate this

For each citation c in C

Query search engine with title of c as phrase search to obtain a set of relevant URLs

Represent c by a feature vector of relevant URLs and weighting scheme

Apply hierarchical agglomerative clustering (HAC) on C to derive k clusters

Cosine similarity

Tested with single link, complete link and group average

Weighting: Inverse Host Frequency (IHF)

Observation

Not all URLs are equally useful

e.g., aggregator services

Desired weighting scheme

Low weights to aggregator web sites

High weights to personal and group publication pages

Inverse Host Frequency (IHF)

Similar to Inverse Document Frequency (IDF) in information retrieval

Consider citations of top 100 authors in DBLP (by number of citations)

For each such citation, query search engine with its title to obtain URLs, truncate them to their hostnames

If a hostname h has frequency f(h), then its IHF is

Weighting: Inverse Host Frequency (IHF)

We notice that using hostnames alone may be problematic

Especially when a host has multiple hostnames or is represented by an IP address with dissimilar distributions

e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and 136.199.54.185 are the same host

Therefore, we also experimented with

Domain (e.g. uni-trier.de)

Resolving hostnames to IP addresses

Evaluation

Dataset

Manually-disambiguated dataset of 24 ambiguous names in computer science domain

Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3

Each name is attributed to 30 citations on average

Proportion of largest class ranges from 50% to 97%

Search engine

Google (http://www.google.com/)

Evaluation

Single link performs best

Good for clustering citations from different publication pages together (some pages list only selected publications)

Some authors have disparate research areas, not well represented by a centroid vector

Resolving hostnames to IP addresses give best accuracy

Comparison to [Lee et al. 05, IQIS]

Discussion

Discussion

Apparent correlation between accuracy and average number of URLs returned per citation

Author names with few URLs tend to fare poorly since results are mainly aggregator web sites

We do not observe any apparent relation between accuracy and number of citations for an author name

Our algorithm is scalable for large number of citations

Analysis of returned URLs is very fast, execution time is dominated by search engine querying

Querying may already be done while spidering, so our algorithm is time-efficient

Conclusion

Summary

We focused on using URLs returned from searching citation titles

Respectable average accuracy of 0.836 using IP addresses with single link HAC clustering

Future work

Explore other sources of information, such as the publication venues of the citations as well as utilizing the actual contents of the web pages

Combine knowledge gained externally and internally to obtain improved performance


	Yee Fan Tan and Min-Yen Kan
	Department of Computer Science National University of Singapore, Singapore
	{tanyeefa,kanmy}@comp.nus.edu.sg
	Dongwon Lee
	College of Information Sciences and Technology The Pennsylvania State University, USA
	dongwon@psu.edu


	Bibliographic digital libraries
		Contains a large number of publication metadata records
		e.g. Citeseer, DBLP
		Commonly used to measure impact of researchers on community
	Problem
		What happens when different authors share the same name?


	Given:
		An author name string X representing k unique individuals
		A list of citations C containing the name X
	The task:
		For each citation c in C, determine which of these k individuals c belongs to
	K-way classification or clustering problem


	Past work (Lee et al. 05, Han et al. 05) used internal resources:
		Knowledge encoded in the records themselves
		Used field similarity, common co-author strings for clustering
	Problems with using only internal resources
		May provide insufficient information or difficult to extract
		e.g., two papers on the same topic using disjoint keywords in their titles
	Therefore, we use resources external to the citation data


	Hypothesis: Using external resources as in URL would help disambiguate author names with mixed citation

	Many factors to consider:
		Which external resources to use: URL, web page contents, affiliation, etc
		How to use: both internal and external? How to mix?
		How to apply external resources? Weighting?
	Preliminary study focuses on the case using URL and simple weighting


	Lay people doing this task with unfamiliar publications may use a search engine, using paper title as query
	Our method tries to approximate this
	For each citation c in C
		Query search engine with title of c as phrase search to obtain a set of relevant URLs
		Represent c by a feature vector of relevant URLs and weighting scheme
	Apply hierarchical agglomerative clustering (HAC) on C to derive k clusters
		Cosine similarity
		Tested with single link, complete link and group average


	Observation
		Not all URLs are equally useful
		e.g., aggregator services
	Desired weighting scheme
		Low weights to aggregator web sites
		High weights to personal and group publication pages
	Inverse Host Frequency (IHF)
		Similar to Inverse Document Frequency (IDF) in information retrieval
	Consider citations of top 100 authors in DBLP (by number of citations)
	For each such citation, query search engine with its title to obtain URLs, truncate them to their hostnames
	If a hostname h has frequency f(h), then its IHF is


	We notice that using hostnames alone may be problematic
		Especially when a host has multiple hostnames or is represented by an IP address with dissimilar distributions
		e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and 136.199.54.185 are the same host
	Therefore, we also experimented with
		Domain (e.g. uni-trier.de)
		Resolving hostnames to IP addresses


	Dataset
		Manually-disambiguated dataset of 24 ambiguous names in computer science domain
		Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3
		Each name is attributed to 30 citations on average
		Proportion of largest class ranges from 50% to 97%
	Search engine
		Google (http://www.google.com/)


	Single link performs best
		Good for clustering citations from different publication pages together (some pages list only selected publications)
		Some authors have disparate research areas, not well represented by a centroid vector
	Resolving hostnames to IP addresses give best accuracy


	Apparent correlation between accuracy and average number of URLs returned per citation
		Author names with few URLs tend to fare poorly since results are mainly aggregator web sites
	We do not observe any apparent relation between accuracy and number of citations for an author name
		Our algorithm is scalable for large number of citations
	Analysis of returned URLs is very fast, execution time is dominated by search engine querying
		Querying may already be done while spidering, so our algorithm is time-efficient


	Summary
		We focused on using URLs returned from searching citation titles
		Respectable average accuracy of 0.836 using IP addresses with single link HAC clustering
	Future work
		Explore other sources of information, such as the publication venues of the citations as well as utilizing the actual contents of the web pages
		Combine knowledge gained externally and internally to obtain improved performance