Data Formats
COOL uses a native column-oriented data format to facilitate cohort and analytical queries. The storage hierarchy is summarized in the figure.
A COOL Instance stores a dataset as a set of data tables under a directory. Each data table corresponds to a subdirectory, and it is horizontally partitioned into cublets, which follow the storage layout shown in the figure. A cublet is further horizontally partitioned into chunks. Within each chunk, data are stored by column, and metadata and indexes are built to speed up queries. For each table, a YAML file is needed to specify its schema. An example directory structure is shown here:
dataset
โโโ cube-0
โ โโโ table-cube-0.yaml
โ โโโ version-0
โ โโโ cublet-0.7z
โ โโโ cublet-1.7z
โ โโโ cublet-2.7z
โโโ cube-1
โโโ table-cube-1.yaml
โโโ version-0
โโโ cublet-0.7z
โโโ cublet-1.7z
COOL supports multiple popular input data formats, from which the system can automatically convert them into native storage format.
Cublet
A Cublet is a file with one MetaChunks and one or more DataChunks to store a group of records. It uses a list of offsets to quickly locate each chunk.
Bytes written to a file:
|-datachunks-|-metachunk-|-header-|-header offset-|
The header includes:
|-#chunk-|-chunk offsets-|
MetaChunk
A MetaChunk describes the value ranges of each dimension in a MetaField. It keeps a list of offsets, one for each MetaField to quickly locate them.
Bytes written to a file:
|-metachunks-|-header-|-header offset-|
The header includes:
|-chunkType-|-#field-|-field offsets-|
HashMetaField
A HashMetaField describes a field of type AppKey, UserKey, Action and Segment. It stores a field metadata as follows:
finger
: the sorted list of hash values compressed. It is used to locate the value and its global id.global ids
: the global id assigned to each of the value in the order of their position in finger.#values
: number of valuesvalue offsets
: the offset of each value invalues
in the order of their position in finger.values
: the list of unique values of the field, with the same sort order as in finger. The bytes spanning#values
,value offsets
andvalues
are compressed with lz4.
Bytes written to file:
|-finger compressor codec-|-compressed finger-|-global ids compressor codec-|-compressed codec-|-values compressor codec-|-compressed values-|
compressed values bytes (in uncompressed form):
|-#values-|-value offsets-|-values-|
RangeMetaField
A RangeMetaField describes a field of type ActionTime and Metric. These raw values of these fields are numbers. Currently, their min and max are stored here.
DataChunk
A data chunk store a group of record in column oriented manner. There are two types of format, HashField and RangeField for different field types with tailored indexing and compression.
HashField
A HashField describes the values in each record of a field that belongs to a type described by HashMetaField.
keys
: global ids of terms appeared in this data chunk, in ascending order.values
: a value the field takes is represented with a local id, which is the index of its global id inkeys
.preCal
andmatch sets
: when pre-calculation is specified, one or more bit sets (match sets
) are stored in place of values.
Bytes written to file: (when pre-calculation is off)
|-key compressor codec-|-compressed keys-|-values compressor codec-|-values-|
Bytes written to file: (when pre-calculation is on)
|-key compressor codec-|-compressed keys-|-PreCal codec-|-#bitset-|-compresed bitsets-|
Currently the bitset is compressed with run-length encoding
RangeField
A HashField describes the values in each records of a field that belongs to a type described by HashMetaField.
Bytes written to a file:
|-range codec-|-min-|-max-|-values compressor codec-|-compressed values-|