Page tree
Skip to end of metadata
Go to start of metadata

Do you want to test Jethro but don't have any data to use? With the csvgen utility you can very easily generate mock data and load it into Jethro.
The utility generates csv files that can be loaded into Jethro based on your specifications. This document describes how to use the csvgen utility to generate data and load into Jethro.

csvgen

To use csvgen:

  1. browse csvgen github page at https://github.com/JethroData/csvgen 
  2. click on Download Zip to download a zip file containing all the required files. 
  3. Extract the zip file, and open the README.md file for instructions on how to install and use csvgen.

Before generating the data, you need to create the input file that instructs csvgen how to generate the data. The input file contains text where each row corresponds to a column in the table and contains a description of how to generate data for that column. 

Each row begins with the data type and may contain a list of parameters, as follows:

Data TypeDescriptionParametersExamples
row_numberGenerates a unique row number for each row, starting at 1 or at the number provided as parameter
  • Start from - The number to start from, optional.
  • Prefix -the prefix to prepend to the number, optional.
  • row_number 1000
  • row_number 100000 customer_
boolean Generates a Boolean value, true or false.
  • Null % - The % of NULL values, 0-100, optional.
  • boolean 20
uuid Generates a unique id
  • Null % - The % of NULL values, 0-100, optional
  • uuid 10

ip_address

Generates an IP address.
  • Null %, the % of NULL values, 0-100, optional.
  • ip_address 20

regex

Generates a string based on a regular expression.
  • Expression- The regular expression, mandatory.
  • Null % - The % of NULL values, 0-100, optional.
  • regex "201[2-5](0[1-9]|1[0-2])" 20 
  • regex [a-zA-Z0-9.+]{4,20}.(com|net|org)

word

Generates a random (meaningless) word.
  • Min - The minimum number of syllables, mandatory.
  • Max - The maximum number of syllables, mandatory.
  • Unique- The number of unique words to generate, optional (0 means no limit).
  • Null % - The % of NULL values, 0-100, optional.
  • word 2 4 0 15 
  • word 1 3 1000

number

Generates a whole or decimal number.
  • Min - Minimum value, mandatory.
  • Max- Maximum value, mandatory.
  • Decimal- The number of decimal places (0 for whole), mandatory.
  • Null %- The % of NULL values, 0-100, optional.
  • number 1 100 0 10 
  • number -999 999 2 2

date

Generates a date with a specified format within a given period.
  • Start- Start date period, mandatory.
  • End - End date period, mandatory.
  • Format - The date format as described here, mandatory.
  • Null % - The % of NULL values, 0-100, optional.
date "01-01-2012 00:00:00" "12-31-2012 23:59:59" "%m-%d-%Y %H:%M:%S" 30 
date "05-27-2014" "06-31-2014" "%m-%d-%Y" 25

fixed

Generates a string provided as parameter.
  • Fixed value - The string to generate, mandatory
  • Null % - The % of NULL values, 0-100, optional.
  • Fixed hello 10

list

Generates a string from a list of values.
  • List - List of values separated by a space enclosed by "[ ]" mandatory.
  • Null % - The % of NULL values, 0-100, optional.
list "[dog cat mouse]" 20 
* You must enclose the list with quotes.
custom listGenerates a string from a custom list of strings. For a given type, there should be a file {type}.csv in the data directory. The file should contain a list of values, each one on a separate line. One of these values will be picked randomly.
The data directory already contains several type files, such as first_namelast_namestate, and country. You can add new custom types by adding such a file to the data directory.
  • Unique- The number of unique values to pick, optional (0 means pick from the entire list).
  • Null % - The % of NULL values, 0-100, optional.
first_name 
color 10 
brands 100 10



Generating Data and Loading it into Jethro

The following steps assume that you already installed Jethro, and that you have created an instance named 'demo'. If your instance is named differently, just replace 'demo' with your instance name.


Let's say that we want to generate data for a customer table that has the following definition:

CREATE TABLE customer 
( 
customer_id INTEGER, 
first_name STRING, 
last_name STRING, 
age INTEGER, 
birth_country STRING, 
gender STRING, 
phone_number STRING, 
last_transaction_date TIMESTAMP, 
balance FLOAT 
);

 
Save the above data definition in a file called customer.ddl, and run the following command from the Jethro server console as user Jethro to create the table:

JethroClient demo localhost:9111 -p jethro -i customer.ddl 


Next we create a Jethro loader description file that will be used to load the data. Save the following in a file called customer.desc:

table customer 
overwrite
row format delimited
fields terminated by '|'
(
customer_id,
first_name,
last_name,
age,
birth_country,
gender STRING,
phone_number,
last_transaction_date format='MM-dd-yyyy',
balance
)


Now, we need to create the input file that instructs csvgen how to generate the data. Here is an example of such an input file for the customer table:

row_number 1000000
word 1 4 1000
word 1 4 3000
number 18 100 0
country
list '[Male Female]'
regex '[1-9]\d{2}\d{3}\d{4}'
date '01-01-2000' '01-01-2016' '%m-%d-%Y' 25
number -999 999 2 10


It will generate the following for each column:

  • customer_id – A unique number starting at 1000000 for the first row and incrementing by one for subsequent rows.
  • first_name – A random word, 1 to 4 syllables, out of a list of 1000 words.
  • last_name – A random word, 1 to 4 syllables, out of a list of 3000 words.
  • age – A whole number between 18 and 100.
  • birth_country – A value from a list of countries. The list is taken from an existing file named country.csv in the data directory.
  • Gender – Either "Male" or "Female".
  • phone_number – A string of 10 digits not starting with 0 generated by a regex with the format xxx xxx xxxx.
  • last_transaction_date – A date between 01-01-2000 and 01-01-2016 with 25% of NULL values.
  • balance – A decimal numbers with 2 places after the decimal point between -999 and 999 with 10% of NULL values.

Save the input text in a file called customer.in.

Verify that the data is generated correctly by generating 100 rows with this command:

./csvgen.py -i 100 -d '|' customer.in


./csvgen.py -i 100 -d '|' customer.in

It will send the output to the console. Here is an example of the last few lines of the output:

1000093|Paf|Yuc|76|Venezuela|Male|403-740-4749||706.68
1000094|Col|Axxa|64|Argentina|Male|371-486-9796|03-12-2009|
1000095|Uz|Ec|99|Italy|Male|624-579-8640|11-03-2000|-115.25
1000096|Bakdi|Iv|71|Iceland|Male|496-820-5813|03-13-2013|172.57
1000097|Vot|Iccife|71|United States|Male|637-522-6319||556.36
1000098|Nauhel|Owle|82|Senegal|Male|998-729-3208|11-24-2007|543.28
1000099|Ibri|Afzu|18|Slovenia|Female|629-826-0867|05-25-2013|432.62


Next, we will generate the data to be loaded to Jethro.
Let's generate data for 10000 customers, using the delimiter '|' and store it in a file customer.csv:

$./csvgen.py -i 10000 -d '|' -o customer.csv customer.in


Now that we have the csv file ready, we can load it into Jethro. Run the following command on the Jethro server console with user Jethro:

JethroLoader demo customer.desc customer.csv


You should see the name of the loader log file in the console. When the load finishes, view the log file to make sure that the load finished successfully. You should see lines like the following indicating that all the data was loaded:

Input records: 10000
Records loaded: 10000
Records skipped: 0 ( 0 empty lines, 0 no matching rules,
0 header lines, 0 outside partitions 
boundary )
Records rejected (see reject rep0
t file 'loader_rejects_20160304_212349_46837.out':Table rows - before :0
Table rows - truncated: 0
Table rows - after: 10000


Another option is to generate the data and pass it to the loader using a pipe instead of saving it in a file. This can be achieved using the following:

./csvgen.py -i 10000 -d '|' customer.in | JethroLoader demo customer.desc stdin


Speeding Up Data Generation

In order to generate a large amount of data, you can use the generate.sh script which can run multiple csvgen.py in parallel. This speeds up the data generation process and can be useful to generate data for large fact tables.
It spawns as many processes as requested and creates temporary output file for each. When all processes are done, it combines the output to one file.
To use it, run the following:

./generate.sh {name} {rows} {number of processes} {delimiter}

Where:
{name} is a description file named {name}.in. It also uses the same name for the output file which is generated as {name}.csv. This parameter is mandatory.
{rows} is the number of rows to generate for each process. Default =1M
{number of processes} is the number of processes to run in parallel. Default=1
{delimiter} is the column delimiter to use. Default='|'

To generate data for 1000000 customers using 20 parallel processes, issue the command:

./generate.sh customer 50000 20 '|'


Note that you need to specify the number of rows for each process and not the total number of rows. For best results, align the number of processes with the number of CPU's that the server has.

  • No labels