Jethro organizes tables and views into schemas. The default schema is called def_schema. Currently, creating new schemas is disabled.
Working with Tables
Creating a Table
To create a table, run a CREATE TABLE statement and provide the table name and list of columns.
For further information on, and examples of, partitioned tables, see Working with Partitioned Tables.
The supported data types are:
INT | BIGINT | FLOAT | DOUBLE | STRING | TIMESTAMP
For more details on each data type, see Data Types under SQL Language Elements.
Jethro automatically creates indexes for all columns – there is no need to run CREATE INDEX commands.
Renaming a Table
To rename a table, run an ALTER TABLE … RENAME TO statement. For example:
Adding and Dropping Columns from a Table
To add a new column to a table:
- Run an ALTER TABLE ... ADD COLUMN statement.
- Specify a new column name and data type;
Note: Existing rows will have NULL value in this new column.
To drop an existing column from a table, run an ALTER TABLE ... DROP COLUMN statement. For example:
Dropping a Table
To drop a table, run a DROP TABLE statement. For example:
Alternatively, run a TRUNCATE TABLE statement to delete all rows from a table while keeping the table. For example:
Listing All Tables
To list all tables, run the SHOW TABLES statement:
Unlike Hive and Impala, in Jethro the above command lists only tables, not views. To list views, use SHOW VIEWS command.
In addition, to get a detailed report of all tables including number of columns, number of rows, number of partitions and size on disk, run the SHOW TABLES EXTENDED statement (the output was slightly trimmed to fit into the page width):
Listing Table Columns
To list all columns of a specific table and their data type, run the DESCRIBE statement:
In addition, to get a detailed report of a table's columns, including number of distinct values, number of NULLs and size on disk, run the SHOW TABLE COLUMNS table_name statement (the output was slightly trimmed to fit into the page width):
Working with Views
A view is a virtual table, based on the result of an SQL query. While a view contains rows and columns just like an ordinary base table, it does not form part of a physical schema; removing (dropping) a view has no effect on the view's underlying data. Views are very useful for presenting only the requested data; for example, hiding the complexity a query that joins several tables by coding the query's logic into a view, or displaying only the non-confidential rows/columns of table by saving as a view the results of a query that selects theses columns.
Creating a View
To create a view, run a CREATE VIEW statement and provide a view name, an optional list of column name for the view, and the view's query text:
Each column in the view must have a unique and valid name. To give proper name to columns that are based on functions and expressions, either use column aliases in the query (as in the first example above) or specify a column list in the CREATE VIEW statement (as in the second example).
Dropping a View
To drop a view, run a DROP VIEW statement. For example:
Listing All Views
To list all views, run the SHOW VIEWS statement:
In addition, to get a detailed report of all views, including number of columns, status and definition, run the SHOW VIEWS EXTENDED statement:
Listing a View's Columns and Definition
To list all columns of a specific view and their data type, run the DESCRIBE statement. The command also prints the CREATE VIEW statement that was used for creating the view:
When you create a view, Jethro verifies that the view's query is valid. However, after the view is created, its query may become invalid. For example, a view may refer to a table (or another view) that was dropped or modified after the view's creation.
Jethro automatically maintains the status of each view. When the view's query becomes invalid, the view itself changes its status to INVALID and cannot be used. If the view's query becomes valid and matches the view's definition, the view will automatically become valid again.
The current status of each view can be displayed by running the command:
Working with Partitioned Tables
Partitioning physically splits a very large table ("fact table") into smaller pieces, based on user-specified criteria.
Why Use Partitioning?
- Partitioning enables rolling window operations Rolling window operation refers to the archiving and removal (purging) of old data, when new data is added to the data warehouse. This practice is commonly used when deciding to keep only a few years of data, thereby preventing the data in the data warehouse from growing indefinitely.(maintenance). For example, if a table is partitioned by a timestamp column, thereby adding one partition per day, you can implement DELETE by dropping a daily partition.
Because Jethro format is append-only, dropping a partition is the only way to delete data.
- In many databases, partitioning is considered a critical performance boosting tool, because it allows scanning less data during a full-table scan. However, because in Jethro all queries use the indexes to read only the relevant rows' a query will access the same number of rows regardless of whether the table is partitioned. As a result, partitioning is not a major performance feature for Jethro.
- Partitioning also enables better scalability. If a query needs to access a specific subset of the table, based on the partitioning key (for example, a single day), partitioning helps to isolate the requested subset from the actual size of the table – the query will consider a similarly sized subset of the table regardless of the table's retention (one month / year / decade of data).
Which Partitioning Types are Supported?
Jethro supports only range partitioning. In addition, only a single partitioning key is supported. Additional options are not needed, as partitioning is used mostly for rolling window operations and not as a tool for minimizing I/O (we use the indexes for that!)
How do I Choose a Partition Key?
The partitioning key should be the main timestamp column that is used both for maintenance (keeping data for n days) and in queries. That column data type is typically TIMESTAMP, but occasionally it is a generated key (usually INT) from a date dimension.
In most cases, the large tables hold events (calls, messages, page views, clicks and so on) and the likely partitioning key is the event timestamp, which indicates when the event took place.
How Large Should Each Partition Be?
Generally, you should align the partition range to the retention policy.
- If you plan to keep data for 12 months, purging once a month, start with monthly partitions.
- If you plan to keep data for 60 days, purging once a day, start with daily partitions.
However, it is also recommended to aim for a typical partition size of a few billion rows. Many small partitions are inefficient and may overwhelm the HDFS NameNode with too many small files. A few extra-large partitions of many billions of rows each are harder to maintain – for example, harder to correct the data after loading one bad file. Therefore:
- If you have a few billion rows per month, partition by month.
- If you have a few billion rows per day, partition by day.
- If you have a few billion rows per hour, partition by hour.
How does Jethro Handle a Partition Key with NULL Values?
If the partition key has NULL values, Jethro will create a dedicated partition for those rows. This is part of the normal processing.
How Complex is the Partitioning Process?
In Jethro, partitioning is automatic. You just need to specify a partitioning policy in the CREATE TABLE statement; Jethro creates the actual partitions on the fly, during data load. In addition, partitions do not have names, so there is need to manage them. When you drop a partition, you just reference it by value. For details, see "Dropping a Partition" on page .
Creating a Partitioned Table
Creating a partitioned table requires providing two additional parameters – the column to be used as a range partitioning key (in the PARTITION BY RANGE clause), and the interval that defines the boundaries of each partition (in the EVERY clause).
The way you specify the interval depends on the data type of the partitioning key:
- Examples with timestamp partition key:
CREATE TABLE events (id BIGINT, event_ts TIMESTAMP, value INT) PARTITION BY RANGE (event_ts) EVERY (INTERVAL '1' day);
CREATE TABLE events (id BIGINT, event_ts TIMESTAMP, value INT) PARTITION BY RANGE (event_ts) EVERY (INTERVAL '3' month);
- Example with numeric partition key:
CREATE TABLE events (id BIGINT, event_day_id INT, value INT) PARTITION BY RANGE (event_day_id) EVERY (1);
- Examples with string partition key:
Note: This option is not recommended - you should always keep timestamps in a timestamp data type.
CREATE TABLE events (id BIGINT, event_date STRING, value INT) PARTITION BY RANGE (event_date) EVERY (VALUE);
Adding a Partition
Adding new partitions is performed automatically, on-the- fly, by JethroLoader as it loads new rows. As a result, there is no direct command to add a new partition, and no option (or need) to create empty partitions in advance.
Concurrency: Adding new partitions by JethroLoader is transparent to concurrently running queries. Loading new rows does not block queries, even if it creates new partitions. Once the JethroLoader has completed successfully, new queries will recognize the new partitions and their data.
To list all partitions of a table, run:
This command lists the existing partitions, their range, row count, and disk space.
If the partition column is INT or BIGINT, the partition End Value column is inclusive – the end value is included in the partition range.
If the partitioning column is TIMESTAMP, FLOAT or DOUBLE, the partition End Value column is exclusive – the end value is not included in the partition range. For example, with yearly partitions:
Dropping a Partition
Dropping a partition is carried out by using the ALTER TABLE command. To identify which partition to drop, specify any value within the partition range. Jethro will find in which partition the value resides and will drop that partition. For example:
Concurrency: Dropping a partition can take up to few minutes, as it updates several table-level data structures. However, just like when partitions are added, this is transparent to concurrently running queries – dropping a partition does not block queries. Once the DROP PARTITION completes successfully, new queries will stop seeing the dropped partition and its data.