Beng Chin OOI: Research Philosophy and Impact
Philosophy:
My research focuses on the fundamental data management principles and systems for data-driven applications. Guided by the philosophy that all algorithms and structures should be simple, elegant and efficient, my works lay the foundation for the design and implementation of systems (DBxX) that are not only efficient, robust, but also scalable and secure. I approach research problems holistically, for example, by identifying and exploiting application properties to build novel end-to-end database systems.
My research impact is deeply felt across all the eras of data management in the last 40 years. In the pre-cloud, pre-big data era, my works in storage, indexing, peer-to-peer, shaped the development of multimedia and geospatial applications. They remain hugely influential today, in the design of vector databases and AI-enable applications. In the big-data era, my work on end-to-end systems supporting all data processing steps, from data cleaning, through data curation with human-in-the-loop (crowd sourcing) and big data processing, led to novel distributed systems that are highly efficient, robust, scalable and secure. The systems have been successfully deployed in real industries including healthcare and finance. In the machine learning era, my research was among the first that recognizes and advocates that AI and databases should work synergistically. I led the first Apache project on distributed deep learning system.
I am now realizing the vision of novel systems that combine the best features of AI with those of traditional databases.
Fundamental Database Principles
Data Blade/Cartridge:
In the early years, database systems were mostly relational and not designed to support new data types such as spatial objects and relationships. Instead of constructing a database system from scratch, I chose to develop a subsystem on top of an existing system. For my PhD thesis, I built GEOQL as a GIS cartridge or data blade to an existing system, addressing issues such as SQL extension that supports spatil relationship, indexing (the skd-tree), query optimization (to optimize between spatial and non-spatial operations) and system design. It was published as Springer LNCS monograph -- Lecture Notes in Computer Science #471.
Indexing:
B+-tree based Indexing: Many indexes have been proposed in the literature. However, very few have been implmented into the database backends. This is because, a new conccurency control, join methods, and query processing strategies are affected and must be re-designed. Therefore, I "made" existing B+-tree works efficiently for new data, and it can be directly.
High-dimensional indexing suffers from dimensionality curse, and sometimes, it is more efficient to scan the whole database than to use an index. It is required for many applications such as pattern recognition and ML applications.
I proposed
iDistance
(
iDistance wiki), a simple, elegant and yet efficient distance based high-dimensional indexing, using B+-tree in 2001.
It has public codebase and is used in many extensions, including learned indexes.
In early 2000, emerging applications of data management technology involve the monitoring and querying of large quantities of continuous variables, eg, the positions of mobile service users, termed moving objects.
In such applications, large quantities of state samples obtained via sensors are streamed to a database. Indexes for moving objects must support queries efficiently, but must also support frequent updates. Indexes based on minimum bounding regions (MBRs) such as the R-tree exhibits high concurrency overheads during node splitting, and each individual update is costly.
In 2004, we proposed a simple, elegant and yet efficient
Bx-tree
(
Bx-tree wiki )
that enables the B+ -tree to manage moving objects. We represent moving-object locations as vectors that are timestamped based on their update time.
By applying a novel linearization technique to these values, it is now possible to index the resulting values using a single B+-tree that partitions values according to their timestamp and otherwise preserves spatial proximity. The concept of rolling index based on time rolls the index to keep the index structure efficient, and the concept of speed-aware query rectangle enlargement during query processing supports fast retrieval. More importantly, the index can be grafted into existing database systems cost effectively.
This is in line with my R&D philosophy that all algorithms and structures should be simple, elegant and yet efficient so that they are implementable, maintainable and scalable in actual applications, and all systems must therefore be efficient, scalable, extensible and easy to use.
Filter and Refine:
It is costly to retrieve data optimally. Many algorithms and indexing structure are designed based on filter and refine principle. At the filtering stage, less promising or irrelevant data are quickly eliminated. At the refinement stage, computationally extensive comparison and checking are performed to obtain the answers. For efficiency and scalability, my algorithms and systems have been designed based on this principle. System ArchtecturesP2P
Decentralized
Disaggregated
AI-Powered
Data-Centric AIApache SINGA ( Apache SINGA wiki): In 2012, AlexNet showed the power of Deep Learning on ImageNet. Based on our observations on the resurgence of NN, availability of large amounts of data, especially labelled data, and advancement of hardware, we started the implementation of Apache SINGA in 2014, focusing on performance, scalability and usability, which won us 2024 ACM SIGMOD Systems Award. Below describes the journey.
ASF Announcement on November 2019.
First Apache TLP from South East Asia
Healthcare
Ng Teng Feng General Hospital uses Foodlg for pre-diabetes management Deployment of DL model on SIEMENS machines for NUH
GEMINI + FoodLG
Cohort Analytics