Skip to main content

Data Formats

COOL uses a native column-oriented data format to facilitate cohort and analytical queries. The storage hierarchy is summarized in the figure.

figure

A COOL Instance stores a dataset as a set of data tables under a directory. Each data table corresponds to a subdirectory, and it is horizontally partitioned into cublets, which follow the storage layout shown in the figure. A cublet is further horizontally partitioned into chunks. Within each chunk, data are stored by column, and metadata and indexes are built to speed up queries. For each table, a YAML file is needed to specify its schema. An example directory structure is shown here:

dataset
โ”œโ”€โ”€ cube-0
โ”‚ โ”œโ”€โ”€ table-cube-0.yaml
โ”‚ โ””โ”€โ”€ version-0
โ”‚ โ”œโ”€โ”€ cublet-0.7z
โ”‚ โ”œโ”€โ”€ cublet-1.7z
โ”‚ โ””โ”€โ”€ cublet-2.7z
โ””โ”€โ”€ cube-1
โ”œโ”€โ”€ table-cube-1.yaml
โ””โ”€โ”€ version-0
โ”œโ”€โ”€ cublet-0.7z
โ””โ”€โ”€ cublet-1.7z

COOL supports multiple popular input data formats, from which the system can automatically convert them into native storage format.

Cublet

A Cublet is a file with one MetaChunks and one or more DataChunks to store a group of records. It uses a list of offsets to quickly locate each chunk.

Bytes written to a file:

|-datachunks-|-metachunk-|-header-|-header offset-|

The header includes:

|-#chunk-|-chunk offsets-|

MetaChunk

A MetaChunk describes the value ranges of each dimension in a MetaField. It keeps a list of offsets, one for each MetaField to quickly locate them.

Bytes written to a file:

|-metachunks-|-header-|-header offset-|

The header includes:

|-chunkType-|-#field-|-field offsets-|

HashMetaField

A HashMetaField describes a field of type AppKey, UserKey, Action and Segment. It stores a field metadata as follows:

  • finger: the sorted list of hash values compressed. It is used to locate the value and its global id.
  • global ids: the global id assigned to each of the value in the order of their position in finger.
  • #values: number of values
  • value offsets: the offset of each value in values in the order of their position in finger.
  • values: the list of unique values of the field, with the same sort order as in finger. The bytes spanning #values, value offsets and values are compressed with lz4.

Bytes written to file:

|-finger compressor codec-|-compressed finger-|-global ids compressor codec-|-compressed codec-|-values compressor codec-|-compressed values-| 

compressed values bytes (in uncompressed form):

|-#values-|-value offsets-|-values-|

RangeMetaField

A RangeMetaField describes a field of type ActionTime and Metric. These raw values of these fields are numbers. Currently, their min and max are stored here.

DataChunk

A data chunk store a group of record in column oriented manner. There are two types of format, HashField and RangeField for different field types with tailored indexing and compression.

HashField

A HashField describes the values in each record of a field that belongs to a type described by HashMetaField.

  • keys: global ids of terms appeared in this data chunk, in ascending order.
  • values: a value the field takes is represented with a local id, which is the index of its global id in keys.
  • preCal and match sets: when pre-calculation is specified, one or more bit sets (match sets) are stored in place of values.

Bytes written to file: (when pre-calculation is off)

|-key compressor codec-|-compressed keys-|-values compressor codec-|-values-|

Bytes written to file: (when pre-calculation is on)

|-key compressor codec-|-compressed keys-|-PreCal codec-|-#bitset-|-compresed bitsets-|

Currently the bitset is compressed with run-length encoding

RangeField

A HashField describes the values in each records of a field that belongs to a type described by HashMetaField.

Bytes written to a file:

|-range codec-|-min-|-max-|-values compressor codec-|-compressed values-|