Large
Scale End-to-End Entity Resolution: Algorithms & Explanations
In this Big Data era, many organizations have
become data-rich. However, as data may be collected and integrated from
multiple sources (both internal and external), these organizations also face
the challenge of dealing with “dirty” data. Even for a same real-world entity,
it is possible that multiple occurrences from different sources may have some
discrepancies. For example, David Smith may appear as “Dave Smith” as well as
“David Smith” in two different datasets. Likewise, two seemingly different
product descriptions may refer to the same make of flash card. Similarly, the
two records may describe the same product but the brand in one record may be
incorrectly recorded as part of the name of the product (e.g., product name is
Adobe Acrobat 8 vs product name of Acrobat 8 and brand of Adobe. As such,
before any serious data analytics can be performed, the data must be cleaned to
remove redundancy, inconsistency and missing data, and correct any data errors.
This proposal focuses on one such key task to ensure the quality of the data –
the Entity Resolution problem.
Entity resolution, also known as duplicate
record detection, is a fundamental problem in data integration and data
cleaning. Given two (possibly identical) entity databases D1 and D2, the goal
of ER is to determine for each entity pair rÎD1; sÎD2 whether they represent the same real-world
object. When D1 and D2 are the same, the task is to identify duplicates within
the same database. The problem has a very long research history and various
types of methods have been proposed. While much progress has been made,
existing approaches are still far from offering satisfactory results. Our
preliminary study on widely used benchmark datasets[1]
shows that the F1-scores can be as low as 60% for some datasets,
suggesting that there is much room for improvements.
More importantly, as in other DL-based schemes, it is unclear how to interpret the results – why are two unrelated entities wrongly labelled as a same entity, and why are two occurrences of an entity considered different entities? While there has been some recent works on explaining machine/deep learning models in general, there has not been much effort to interpret ER models.
To address these research challenges, this
project will focus on developing an end-to-end entity resolution framework that
offers not only high accuracy, but also facilitates interpretation of the models.
More specifically, we seek to:
·
Develop new ER schemes. A natural question to ask is whether deep
learning can help in ER problems? If so, how effective are they for different
ER tasks (e.g., structured/textual/dirty)? Do we need more complex models or
would a simple model suffice? How about the size of the training data – would
having more data helps a simple model or a complex model more? We seek to
answer these questions as we develop new ER schemes.
·
Explain ER Predictions. Traditionally, ER schemes are developed and
evaluated based on the notion of “accuracy” (e.g., using recall, precision and
F1 metric). Interestingly, we have discovered that even for a same
model on the same dataset, the accuracy can be very different (depends on how
the training and test data are selected)! It is therefore crucial for
practitioners to understand why certain predictions are made in order for them
to be able to assess the trustworthiness of a model. Unfortunately, while we
have seen an increase in research on “Explainable AI”, there has been very
little effort done in the context of ER. We will investigate mechanisms to help
users to understand ER predictions.
·
Develop a Large Scale End-to-End ER Management
System. We will
integrate our techniques developed above into full-fledge end-to-end ER
management system. Such a system goes beyond just a naïve integration, and will
include additional components such as blocking (a pre-processing phase to an ER
phase) and a rule-based engine to facilitate labeling of training data. For the
framework to be of practical use, we need to ensure that (a) it is efficient even for large datasets. (b) It is easy
to use. We also plan to further enhance our methods to
progressively/incrementally adapt to feedback. In other words, the schemes
should adapt to prediction errors (based on user feedback) to improve in
accuracy.
Team:
·
Kian-Lee
Tan
·
Dongxiang
Zhang (Zhejiang University)
Publications:
·
GNEM: A Generic One-to-Set Neural Entity Matching Framework. R. Chen, Y. Shen, D. Zhang.
WWW'21, April 19-23, 2021, Ljubljana, Slovenia.
·
Multi-Context
Attention for Entity Matching (Short Paper). D. Zhang, Y. Nie, S. Wu, Y. Shen, K.L.
Tan. WWW'2020, Taipei, Taiwan, April 20-24, 2020, pp. 2634-2640.
·
Unsupervised
Entity Resolution with Blocking and Graph Algorithms. D. Zhang, D. Li, L. Guo,
K.L. Tan. IEEE Transactions on Knowledge and Data Engineering. Accepted in Apr
2020.
[1] https://dbs.unileipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_enti