Skip to main content

Turorial-input-format

How to Load Data to COOL

This tutorial firstly shows a complete example of using a local COOL package to load the sample sogamo csv dataset and executes a query. Then describes briefly how the data in other formats can be similarly loaded.

Data sources

Let's take a look at all the source files:

  • schema file: Each field is described by a triplet of name, time, and preCAL (if pre-calculation is used for building COOL's cube). The charset used to write data in bytes can also be changed in this YAML file. Please refer to the schema instruction for more details about how to select the filedType.
    ---
    charset: "UTF-8"
    fields:
    - name: "sessionId"
    fieldType: "AppKey"
    preCal: false
    - name: "playerId"
    fieldType: "UserKey"
    preCal: false
    - name: "role"
    fieldType: "Segment"
    preCal: false
    - name: "money"
    fieldType: "Metric"
    preCal: false
    - name: "event"
    fieldType: "Action"
    preCal: false
    - name: "eventDay"
    fieldType: "ActionTime"
    preCal: false
  • dimension file: It includes comma separated pairs of field-name,value, to describe all unique values a field can take. For Metric and ActionTime field, only the min|max is needed.
    playerId,e9a3374d0d418cdf
    eventDay,2013-05-20|2013-06-26
    country,Australia
    country,United States
  • data file, each records follow the field order specified in the schema file
    fd1ec667-75a4-415d-a250-8fbb71be7cab,43e3e0d84da1056,stonegolem,1638,launch,2013-05-20,3,Australia,OC,Sydney,1,0
    fd1ec667-75a4-415d-a250-8fbb71be7cab,43e3e0d84da1056,stonegolem,1638,fight,2013-05-20,3,Australia,OC,Sydney,1,0
    fd1ec667-75a4-415d-a250-8fbb71be7cab,43e3e0d84da1056,stonegolem,1638,fight,2013-05-20,3,Australia,OC,Sydney,1,0

Loading

After building the COOL system, under the root directory, execute the following command will ingest the sogamo dataset to COOL system.

In Python developing environment, we can send the query by the `request` package
requests.post("http://127.0.0.1:8080/v1/load", data='{"dataFileType": "CSV", "cubeName": "sogamo", "schemaPath": "sogamo/table.yaml", "dimPath": "sogamo/dim.csv", "dataPath": "sogamo/test.csv", "outputPath": "datasetSource"}').text 

Output

A directory named test appears containing the converted dataset. It contains one table named sogamo and the table has one cublet.

test/
โ””โ”€โ”€ sogamo
โ”œโ”€โ”€ table.yaml
โ””โ”€โ”€ v1
โ””โ”€โ”€ 17dd6860ee8.dz

Work with other formats

For Parquet and Arrow IPC file, one can substitute the loader and raw data file for other types of data file.

  • For Parquet file
In Python developing environment, we can send the query by the `request` package
requests.post("http://127.0.0.1:8080/v1/load", data='{"dataFileType": "PARQUET", "cubeName": "sogamo", "schemaPath": "sogamo/table.yaml", "dimPath": "sogamo/dim.csv", "dataPath": "sogamo/test.parquet", "outputPath": "datasetSource"}').text 
  • For Arrow IPC file
In Python developing environment, we can send the query by the `request` package
requests.post("http://127.0.0.1:8080/v1/load", data='{"dataFileType": "ARROW", "cubeName": "sogamo", "schemaPath": "sogamo/table.yaml", "dimPath": "sogamo/dim.csv", "dataPath": "sogamo/test.arrow", "outputPath": "datasetSource"}').text 

For Avro file, the schema file sogamo/avro/schema.avsc shall also be supplied.

In Python developing environment, we can send the query by the `request` package
requests.post("http://127.0.0.1:8080/v1/load", data='{"dataFileType": "AVRO", "cubeName": "sogamo", "schemaPath": "sogamo/table.yaml", "dimPath": "sogamo/dim.csv", "dataPath": "sogamo/avro/test.avro", "outputPath": "datasetSource", "configPath": "sogamo/avro/schema.avsc"}').text