GENIE and LAMP

The GENIE and LAMP project aims to provide a systematic investigation into the use of semi-lazy learning for predictive analytics. GENIE (GENeric InvErted index)) is a unified platform to support storage and retrieve of Big Data with various types of structure. LAMP (semi-LAzy Mining Paradigm) is a new data mining paradigm for predictive analytics, which essentially follows the lazy learning paradigm until the last step where more expensive eager learning methods are applied on a set of KNNs. Specifically, GENIE will efficiently support generic k-Nearest Neighbours (k-NN) search which will be utilized by LAMP to empower users with advanced predictive capabilities.

Figure 1: Overview of GENIE and LAMP

Figure 1 shows the overview framework of GENIE and LAMP. As can be seen, although LAMP and GENIE can logically be seen as separate concepts, integrating them into a single unit can bring synergy that make them more powerful as a predictive analytics engine. GENIE in the integrated scenario can support LAMP by searching for KNNs over a large variety of complex and high dimensional data while LAMP aside from being more efficient, can aggregate predictions and models to make more complex inferences.

Highlights

GENIE: generic inverted index

1 New platforms have been extensively investigated to handle the first two Vs of big data, volume and velocity. Studies into the third V of big data, "variety" is, however, comparatively rare. Many of the big data domains generate a large amount of data with structures (e.g. sequences, trees and graphs) that need to be a process while preserving semantics. It is desirable to develop a unified framework that can be flexibly customized to handle predictive analytics on data with different structures.

2 Specially, GENIE will efficiently support generic k-Nearest Neighbours (k-NN) search. Since LAMP requires a search for KNNs to make a prediction for a query object, finding KNNs becomes a fundamental operation to support LAMP on Big Data with a large variety of complex structures.

3 The GENIE should have the following characteristics:

Cater to many different data types including sequences, trees, time series, trajectories, etc.
Cater to many different similarity functions without rebuilding an index for each function.
Handle partial similarity, for example, two sequences are similar only because they share a sufficiently large sub-sequences.
Easily parallelizable using modern commodity hardware and scale up to a large number of concurrent queries.
Provide a wide variety of configurations (on disk, in-memory, in-GPU) depending on application's need.

LAMP: semi-lazy mining paradigm

1 Data mining algorithms can be divided into two categories: eager learning (e.g. SVM, Neural Network) and lazy learning (e.g. nearest neighbours classification). Eager learning algorithms have high training cost and suffer information loss since they compute and commit to a model before even seeing the prediction query. Lazy learning suffers from their simplistic predicting method although they can capture a very rich set of models and hypothesis from the data. Developing methods that have the strengths but not the weakness of both approaches is highly desirable.

2 LAMP essentially follows the lazy learning paradigm until the last step where more expensive eager learning methods are applied on a set of KNNs. We argue that the modern hardware will allow these eager learning algorithms to be done efficiently on small dataset for real-time prediction.

3 The approach of LAMP goes as follow:

Like lazy learning, LAMP does not commit to a model before prediction but keep the whole dataset intact.
Like lazy learning, given a predictive query (typically a data object with some attribute value/s to be predicted), LAMP will find a set of k most similar neighbors(also referred to as k-nearest-neighbours or KNNs).
LAMP then applies eager learning methods like SVM, neural network on the set of k neighbors to make a prediction.