Usually the source data arrives as compressed text files, and the first step in an ETL process is to convert them to a columnar format for more effective query execution by users.
Let’s consider a simple example when we have a single 120 MB source file in .gz
format:
$ aws s3 ls s3://cloudsqale/hive/events.db/events_raw/ 2018-12-16 18:49:45 120574494 data.gz
and want to convert it into a Hive table with the ORC file format having 256 MB stripe size. Will 120 MB of .gz
data be loaded into a single 256 MB stripe? Not so easy.
The target events
table was created as follows:
CREATE TABLE events ( record_id string, event_timestamp string, event_name string, app_name string, payload map<string, string> -- Mapped JSON data ) PARTITIONED BY (event_dt string) STORED AS ORC LOCATION 's3://cloudsqale/hive/events.db/events' TBLPROPERTIES ("orc.stripe.size"="268435456"); -- 256 MB stripe size
Now let’s load data into this table:
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-16') SELECT record_id, event_timestamp, event_name, app_name, payload FROM events_raw;
Upon completion, firstly let’s see the size of the target ORC file created in events
table:
$ aws s3 ls s3://cloudsqale/hive/events.db/events/event_dt=2018-12-16/ 2018-12-16 20:13:23 160135903 000000_0
The original .gz
file has 120 MB, while ORC file has 160 MB (30% larger) now. You can run the ORC dump utility to see the ORC file structure:
$ hive --orcfiledump s3://cloudsqale/hive/events.db/events/event_dt=2018-12-16/000000_0 File Version: 0.12 with HIVE_13083 Rows: 25351 Compression: ZLIB ... Stripe: offset: 3 data: 64029754 rows: 10000 tail: 532 index: 3351 Stripe: offset: 64033640 data: 93678058 rows: 15000 tail: 550 index: 25461 Stripe: offset: 157737709 data: 2392274 rows: 351 tail: 483 index: 2887 ... File length: 160135903 bytes
So instead of one single stripe, the ORC file contains 3 stripes: 64 MB (10,000 rows), 93 MB (15,000 rows) and 2.3 MB (351 rows).
Let’s investigate ORC writer internals to see why it happened. Looking at the map task log, we can confirm that ORC writer was created with 256 MB stripe:
INFO [main] org.apache.orc.impl.WriterImpl: ORC writer created for path: s3://cloudsqale/hive/events.db/events/.hive-staging_hive_2018-12-16_20-34-14_029_6156246830953134355-1/_task_tmp.-ext-10002/event_dt=2018-12-16/_tmp.000000_0 with stripeSize: 268435456 blockSize: 268435456 compression: ZLIB bufferSize: 262144
When a new row is added to the current stripe, ORC Memory Manager checks after every 5,000 rows (defined by orc.rows.between.memory.checks
option that can be from 1 to 10,000 rows, not ppliad for Hive 2.1 and earlier) if the ORC writer needs to flush the current stripe and start the new one.
In Debug log you can see the following records:
DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000 DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 202588250 limit = 268435456 DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000 DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 424133109 limit = 268435456 DEBUG [main] org.apache.orc.impl.MemoryManager: flushed DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000 DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 121520440 limit = 268435456 DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000 DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 226542060 limit = 268435456 DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000 DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 334103206 limit = 268435456 DEBUG [main] org.apache.orc.impl.MemoryManager: flushed
You can see that the first stripe was flushed after 10,000 rows and the second stripe was flushed after 15,000 rows that exactly matches the structure of ORC file generated by orcfiledump
utility shown above.
In this particular example, limit
is the stripe size (in general it can be scaled depending on the memory pool size and number of writers), while size
is the current size of data.
Why did size
quickly become so big (remember that the entire source .gz
file is just 120 MB) so it needs to be flushed very often?
The reason is that size
is the uncompressed size of data, and in my example payload
column is JSON data and compression rate for JSON can be more than 10x. Let’s now add some settings to create an ORC file with the single stripe.
I am going to change the stripe size to 1.2 GB at the table level, but it is just metadata, such large size practically will never be created when loading JSON data from .gz
:
ALTER TABLE events SET TBLPROPERTIES ("orc.stripe.size"="1268435456");
Stripe size now exceeds the default heap at 1 GB, and the heap will limit the stripe buffer size now, so let’s increase it as well as the percentage of heap that can be used for the ORC buffer:
set mapreduce.map.memory.mb = 2048; set mapreduce.map.java.opts = -Xmx1638m; -- Allow stripe buffer to take more memory from heap set hive.exec.orc.memory.pool = 0.8;
Now if you rerun the INSERT statement and look at the ORC file content, you can see that only 1 stripe was created:
Stripe: offset: 3 data: 161226974 rows: 25351 tail: 445 index: 27384
Hive 2.1, Amazon EMR and S3 Specific Issues
In our example, we converted a relatively small 120 MB .gz
file into ORC. If you repeat the same steps but now for 4 GB .gz file, you can still see the following stripes:
Stripe: ... rows: 15000 Stripe: ... rows: 5000 Stripe: ... rows: 5000 Stripe: ... rows: 5000 Stripe: ... rows: 5000 Stripe: ... rows: 25000 Stripe: ... rows: 5000 Stripe: ... rows: 25000 Stripe: ... rows: 5000 Stripe: ... rows: 25000 Stripe: ... rows: 5000 Stripe: ... rows: 25000 File length: 7868020056 bytes Padding length: 1691786646 bytes Padding ratio: 21.5%
You can see large (15,000 and 25,000 rows) and small (5,000 rows) stripes still interchange, and there is significant padding in 1.6 GB added to the ORC file.
The problem is that when the ORC writer flushes the current stripe it checks how much space is left in the block (defined by hive.exec.orc.default.block.size
with 256 MiB by default).
If available block size/stripe size
ratio is higher than the block padding tolerance value (defined by hive.exec.orc.block.padding.tolerance
with the default 0.05), then the size of one or several next stripes is adjusted to fill in the block. That’s why small stripes with minimal 5,000 rows were created.
If available block size/stripe size
ratio is lower than the block padding tolerance value then ORC just writes dummy bytes to pad the stripe to fill in the block.
Since we do not want to have small stripes, and there is no any value in padding the block for S3 object storage, we can use the following options to switch them off at the table level:
ALTER TABLE events SET TBLPROPERTIES ("orc.block.padding"="false", "orc.block.padding.tolerance"="1.0");