Prepare datasets
This tutorial shows how to prepare necessary materials as COOL inputs.
Let's take the TPC-H dataset preparation process as an example to demonstrate how we prepare these files.
Step1. Download TPC-H dataset.
Download from data source and extract from data from the tic zip file.
Step2. Prepare Table.yaml
we need to firstly define table.yaml
to describe each filed. Please refer to the schema instruction for more details about how to select the filedType.
charset: "UTF-8"
fields:
- name: "O_ORDERKEY"
fieldType: "Segment"
preCal: false
- name: "O_ORDERSTATUS"
fieldType: "Segment"
preCal: false
- name: "O_TOTALPRICE"
fieldType: "Metric"
preCal: false
- name: "O_ORDERDATE"
fieldType: "ActionTime"
preCal: false
- name: "O_ORDERPRIORITY"
fieldType: "Segment"
preCal: false
- name: "O_CLERK"
fieldType: "Segment"
preCal: false
- name: "O_SHIPPRIORITY"
fieldType: "Segment"
preCal: false
- name: "O_COMMENT"
fieldType: "Segment"
preCal: false
- name: "C_NAME"
fieldType: "Segment"
preCal: false
- name: "C_ADDRESS"
fieldType: "Segment"
preCal: false
- name: "C_PHONE"
fieldType: "Segment"
preCal: false
- name: "C_ACCTBAL"
fieldType: "Metric"
preCal: false
- name: "C_MKTSEGMENT"
fieldType: "Segment"
preCal: false
- name: "C_COMMENT"
fieldType: "Segment"
preCal: false
- name: "N_NAME"
fieldType: "Segment"
preCal: false
- name: "N_COMMENT"
fieldType: "Segment"
preCal: false
- name: "R_NAME"
fieldType: "Segment"
preCal: false
- name: "R_COMMENT"
fieldType: "Segment"
preCal: false
- name: "app"
fieldType: "AppKey"
preCal: false
- name: "user"
fieldType: "UserKey"
preCal: false
Step3. Extract TPC-H dataset and convert to dim.csv
, data.csv
In
/TPC-H_Tools_v3.0.0/dbgen
runcp makefile.suite makefile
Open the Makefile, find and modify the following lines (103~111)
################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
# SQLSERVER, SYBASE, ORACLE, VECTORWISE
# Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS,
# SGI, SUN, U2200, VMS, LINUX, WIN32
# Current values for WORKLOAD are: TPCH
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCHFor Mac users, update package name from
#include <malloc.h>
to
#include <sys/malloc.h>
for each file.
Compile with
make
In
/TPC-H_Tools_v3.0.0/dbgen
, generate table namely with./dbgen -f -s n
where n is size of data in GB, it can be 1, 10, 0.1 eg.
./dbgen -f -s 0.1