My research
is in developing database systems and technologies for managing big data in
Data Science applications. Currently, I am working on several projects.
DBMS on Persistent (Non-Volatile) Memory
In the last decade, growing demand for fast, power efficient and cost effective memory solutions have prompted major players like Micron, Intel, Samsung to invest heavily in their R&D for advanced memory technologies. In particular, there has been much anticipation for a next generation of non-volatile memory (NVM) that are byte addressable, persistent, high storage capacity and has near-DRAM read/write latency. This has also prompted forward-looking (software) researchers to develop novel programming models, memory architectures, data structures, query processing algorithms, recovery techniques, file systems and databases that exploit these features and characteristics to optimize the performance of such NVM technologies. However, as NVM hardware was not available yet, the design decisions made in these works were largely based on certain anticipated behaviors of NVMs and evaluated on emulators/simulators. For example, NVM is expected to offer read latency comparable to DRAM but much higher write latency and to have limited write endurance. These led to algorithms that minimize writes to NVM. As another example, the assumption that NVM has DRAM-like cacheline access granularity and near-DRAM read/write bandwidth facilitated the design of solutions which are (supposedly) scalable.
In 2019, non-volatile memory DIMMs are finally commercially available with the release of the Intel® Optane™ DC Persistent Memory Module (or just “Optane DC PMM” or “Optane NVDIMM”). Since then, there has been a proliferation of research studies that evaluate this technology as well as re-evaluate prior art on the real Optane DC PMM. The results of these studies showed that some of the earlier assumptions on NVM are not applicable to Optane DC PMM. For example, in Optane DC PMM, data are accessed in blocks of 256 bytes. This is in contrast to the DRAM where the cacheline access granularity is 64 bytes. This translates to a higher read latency for Optane DC PMM than DRAM. Moreover, it turns out the write latency can be as good as DRAM write latency as Optane DC PMM guarantees that data in the Optane Controller Buffer will be persisted on NVM. As another example, the bandwidth of Optane DC PMM for both read and write is limited which leads to poorer scalability than expected. In addition, it turns out that the limited write endurance may not be a critical issue for software developers to fix since hardware-layer wear-levelling is an inlined technique in Optane DC PMM.
In this project, we study how persistent memory impact the design of DBMS. We will focus particularly on the App Direct Mode of the Optane DC PMM which supports in-memory persistence. While today’s software operates under the assumption that data can only be persistent if it is written to slow storage (SSDs, HDDs, the cloud, etc.) this mode allows data to persist at memory speeds. As a start, we will build a full-fledged in-NVM DBMS by integrating techniques that have been developed in isolation (e.g., B+-tree, sorting and joining algorithms, storage engine).
See here for more information.
Large Scale End-to-End Entity Resolution: Algorithms & Explanations
Entity matching/resolution, also known as duplicate record detection, is a fundamental
problem in data integration and data cleaning. Given two (possibly identical) entity databases
D1 and D2, the goal of ER is to determine for each entity pair r in D1, s in D2, whether they represent the same real-world object. When D1 and D2 are the same, the task is to identify duplicates within the same database. The problem has a very long research history and various types of methods have been proposed. While much progress has been made, existing approaches are still far from offering satisfactory results. Our preliminary study on widely used benchmark datasets shows that the F1-scores can be as low as 60% for some datasets, suggesting that there is much room for improvements.
More importantly, as in other DL-based schemes, it is unclear how to interpret the results - why are two unrelated entities wrongly labelled as a same entity, and why are two occurrences of an entity considered different entities? While there has been some recent works on explaining machine/deep learning models in general, there has not been much effort to interpret ER models.
See here for more information.
MAGE: Dynamic Graph Analytics On GPUs
In many real-world
graph applications, data are frequently updated. For example, online social
networks continuously evolve as new social interactions occur. To enable online
analytics for data streaming applications, we are motivated to propose a
dynamic graph analytic system on GPUs, which simultaneously processes a stream
of graph update operations and analytical queries on the dynamically maintained
matrix representation of the graph. The high update rate brought about numerous
questions and challenges:
·
How
can we efficiently perform updates against the matrix stored on GPUs? Due to
the Single Instruction Multiple Thread (SIMT) architecture of GPU devices,
there are two keys in optimizing GPU-resident applications: 1) minimizing
thread divergence; 2) maximizing coalesced memory access. It is non-trivial to
design an update-efficient storage scheme on GPU, which also supports blazing
fast computation. On one hand, rebuilding the entire storage structure against
each update is both time and space inefficient. On the other hand, CPU-styled
updates (e.g. insertion/deletion) of SM entries tend to scatter data in the GPU
memory which leads to uncoalesced memory access and increases of the level of
thread divergence for the computation.
·
How
to expand existing SM primitives to support a wider range of applications?
Existing matrix primitives, e.g. SM vector multiplication, operate on the
entire matrix. This is too rigid for many applications. For instance, in the
case of community tracking on dynamic networks, it is rather inefficient to
query the adjacency matrix as a whole since the community is only a subgraph of
the entire social network. In another example, it is often of interest to
monitor and analyze certain consumers' purchase history (e.g. Singaporean) from
a large matrix representing all consumers' purchase history (e.g. each row of
the matrix represents a consumer and each row entry denotes which item was
bought by the consumer). Again, running the analysis on the entire matrix is
undesirable as only a subset of the rows is required.
·
How
to optimize dynamic graph analytics on GPUs? Note that it does not suffice to
optimize each matrix primitive on GPUs, as many analytic queries are submitted
simultaneously on the dynamically updated matrix in various application scenarios.
Analysts may track different communities in a dynamic social network. Likewise,
marketers are keen to analyse real-time sales records
of different consumer groups to deliver targeted advertising instantly. To
fully exploit the computational capabilities of GPUs, it calls for novel
strategies of scheduling online graph analytics processing on GPUs. In
addition, to facilitate applications in larger scales, there is a need to
support optimizations on multiple GPUs and CPU-GPU heterogeneous computing
environments.
·
How
to optimize graph processing algorithms for dynamic graphs? Existing analytics
algorithms like pagerank and subgraph matching have
to be redesigned to fully exploit a CPU-GPU heterogeneous computing
environment.
To address
the aforementioned research challenges, this project will focus on developing
novel storage structures and algorithmic optimization for supporting dynamic SM
analytics on GPU-assisted platforms.
Team:
·
Kian-Lee
Tan, NUS
·
Wentian Guo, NUS
·
Mo
Sha, NUS
·
Yuchen
Li, SMU
Publications:
· Enhancing Balanced Graph Edge Partition with Effective Local Search . Z. Guo, M. Xiao, Y. Zhou, D. Zhang, K.L. Tan, 35th AAAI Conference on Artifical Intelligence (AAAI'2021), Vancouver Convention Centre, Vancouver, Canada, 2-9 Feb 2021.
· Exploiting Reuse for GPU Subgraph Enumeration . W. Guo, Y. Li, K.L. Tan, IEEE Transactions on Knowledge and Data Engineering, accepted Oct 2020.
· GPU-Accelerated Subgraph Enumeration on Partitioned Graphs . W. Guo, Y. Li, M. Sha, B. He, X. Xiao, K.L. Tan, SIGMOD 2020, pp. 1067-1082.
·
TopoX: Topology Refactorization for Efficient Graph
Partitioning and Processing. D. Li, Y. Zhang, J. Wang, K.L.
Tan, PVLDB, Vol. 12, No. 8, 2019, pp. 891-905.
(An extended version appeared in
IEEE/ACM Transactions on Networking, Vol. 28, No. 6, pp.
2768-2782, December 2020.)
·
GPU-based Graph Traversal on
Compressed Graphs.
M. Sha, Y. Li, K.L. Tan, SIGMOD 2019, pp. 775-792.
·
Parallel Personalized Pagerank on Dynamic Graphs W. Guo, Y. Li, M. Sha, K.L. Tan, PVLDB , Vol.
11, No. 1, 2017, pp. 93-106.
·
Accelerating Dynamic Graph Analytics
on GPUs, M. Sha, Y.
Li, B. He, K.L. Tan, PVLDB, Vol. 11, No. 1, 2017, pp.
107-120.
SONAR: Social Network Analytics
Today’s
social networking platforms (SNPs) offers a rich amount of information. This
include not only users profiles but also the social
relationships between users and the activities on SNPs. With such information,
it is possible to extract user preferences and understand what/how information
are shared between friends or followers. We have been investigating innovative
solutions for social media marketing (SMM) in SNPs, including influence
maximization, and different channels for SMM such as directed target
advertising (i.e., advertisements are directly posted to match users'
preferences), celebrity social advertising (i.e, advertisements are
endorsed by celebrities) and self influential
advertising (i.e., advertisements are propagated to wider audiences through the
advertiser's friends or followers). Our recent work have considered a variety
of other critical factors that lead to a variety of important
variants of the problem:
·
Stream Influence Maximization. Social influence is highly
dynamic and the rate at which information is propagated between users can be
altered drastically by breaking news and trending topics. This means the influencers
are also dynamically changing. How to maintain a changing set of influencers
continuously is a challenge.
·
Topic-aware Influence Maximization. Users in a social network may have
various interests (which can be represented by topics), and the influence
strength between two users (i.e., edge) is often topic-dependent. For example,
a user may be influenced by her friend in some topics (e.g., sports),
while remaining neutral/unaffected in others (e.g., politics). It is therefore
important to design solutions that are topic-aware, i.e., considers
user-to-user influence strength (edges) is topic-dependent.
·
Location-aware Influence
Maximization. While
two users may be socially close in the social network, they may not be
physically near. As such, it does not make sense to send an advertisement that
is in Singapore to a friend that is in US. Finding the right set of users who
can influence users that are spatially and socially close is the right way to
go.
Our research
seeks to develop approximate schemes that are not only efficient for real-time
applications, but also offer theoretical guarantees on performance.
Team:
·
Kian-Lee
Tan, NUS
·
Yanhao Wang, NUS
·
Yuchen
Li, SMU
Publications:
·
Coresets for Minimum Enclosing Balls over Sliding Windows
Y. Wang, Y. Li, K.L. Tan, KDD 2019: 314-323, 2019.
·
Semantic and Influence aware k-Representative
Queries over Social Streams Y. Wang, Y. Li, K.L. Tan, EDBT 2019: 181-192, 2019.
·
Location-aware Influence
Maximization over Dynamic Social Streams Y. Wang, Y. Li, J. Fan, K.L. Tan, ACM
Transactions on Information Systems 36(4): 43:1-43:35, 2018.
·
Efficient Representative Subset
Selection over Sliding Windows Y. Wang, Y. Li, K.L. Tan, IEEE Transactions on Knowledge and Data
Engineering, July 2019, IEEE CS, 31(7): 1327-1340.
·
Influence Maximization on Social
Graphs: An Survey
Y. Li, J. Fan, Y. Wang, K.L. Tan, IEEE Transactions on Knowledge and Data
Engineering, October 2018, IEEE CS, 30(10): 1852-1872.
·
River: A Real-time Influence
Monitoring System on Social Media Streams (Demo Paper) M. Sha, Y. Li, Y. Wang, W. Guo,
K.L. Tan, Proceedings of the 2018 IEEE International Conference on Data Mining
(ICDM'18), Singapore, Nov 17-20, 2018.
·
A Sliding-Window Framework for
Representative Subset Selection (Poster Paper) Y. Wang, Y. Li, K.L. Tan, Proceedings of the
34th International Conference on Data Engineering (ICDE'18), CNAM, Paris, April
16-19, pp. 1268-1271, 2018.
·
OCTOPUS: An Online Topic-Aware
Influence Analysis System for Social Networks (Demo) J. Fan, J. Qiu,
Y. Li, Q. Meng, D. Zhang, G. Li, K.L. Tan, X. Du, Proceedings of the 34th
International Conference on Data Engineering (ICDE'18), CNAM, Paris, April
16-19, pp. 1569-1572, 2018.
·
Parallel Discovering Your Selling
Points: Personalized Social Influential Tag Exploration Y. Li, J. Fan, D. Zhang, K.L. Tan, 2017
International Conference on Management of Data (SIGMOD'2017), May 14 - May 19,
2017, Chicago, IL, USA, pp. 619-634.
·
Real-Time Influence Maximization on
Dynamic Social Streams Y. Wang, Q. Fan, Y. Li, Kian-Lee Tan, PVLDB (VLDB'2017), Vol. 10, No. 7,
March 2017, pp. 805-816.
·
Context-Aware Advertisement
Recommendation for High-Speed Social News Feeding Y. Li, D. Zhang, Z. Lan, K.L. Tan, Proceedings
of the 32th International Conference on Data Engineering (ICDE'16), Helsinki,
Finland, May 16-20, 2016, pp. 505-516.
·
Realtime Targeted Influence Maximization
for Online Advertisements Y. Li, D. Zhang, K.L. Tan, PVLDB (VLDB'2015), Vol. 8, No. 10, 2015, pp.
1070-1081.
·
Online Topic-Aware Influence
Maximization S.
Chen, J. Fan, G. Li, J. Feng, K.L. Tan, J. Tang, PVLDB (VLDB'2015), Vol. 8, No.
6, 2015, pp. 666-677.