ORC File Format Internals – Creating Large Stripes in Hive Tables

Usually the source data arrives as compressed text files, and the first step in an ETL process is to convert them to a columnar format for more effective query execution by users.

Let’s consider a simple example when we have a single 120 MB source file in .gz format:

$ aws s3 ls s3://cloudsqale/hive/events.db/events_raw/
2018-12-16 18:49:45  120574494 data.gz

and want to convert it into a Hive table with the ORC file format having 256 MB stripe size. Will 120 MB of .gz data be loaded into a single 256 MB stripe? Not so easy.

The target events table was created as follows:

CREATE TABLE events
(
  record_id       string,
  event_timestamp string,
  event_name      string,
  app_name        string,
  payload         map<string, string>   -- Mapped JSON data
)
PARTITIONED BY (event_dt string)
STORED AS ORC
 LOCATION 's3://cloudsqale/hive/events.db/events'
 TBLPROPERTIES ("orc.stripe.size"="268435456");      -- 256 MB stripe size

Now let’s load data into this table:

INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-16')
SELECT
  record_id,
  event_timestamp,
  event_name,
  app_name,
  payload
FROM events_raw;

Upon completion, firstly let’s see the size of the target ORC file created in events table:

$ aws s3 ls s3://cloudsqale/hive/events.db/events/event_dt=2018-12-16/
2018-12-16 20:13:23  160135903 000000_0

The original .gz file has 120 MB, while ORC file has 160 MB (30% larger) now. You can run the ORC dump utility to see the ORC file structure:

$ hive --orcfiledump s3://cloudsqale/hive/events.db/events/event_dt=2018-12-16/000000_0

File Version: 0.12 with HIVE_13083
Rows: 25351
Compression: ZLIB
...
Stripe: offset: 3 data: 64029754 rows: 10000 tail: 532 index: 3351
Stripe: offset: 64033640 data: 93678058 rows: 15000 tail: 550 index: 25461
Stripe: offset: 157737709 data: 2392274 rows: 351 tail: 483 index: 2887
...
File length: 160135903 bytes

So instead of one single stripe, the ORC file contains 3 stripes: 64 MB (10,000 rows), 93 MB (15,000 rows) and 2.3 MB (351 rows).

Let’s investigate ORC writer internals to see why it happened. Looking at the map task log, we can confirm that ORC writer was created with 256 MB stripe:

INFO [main] org.apache.orc.impl.WriterImpl: ORC writer created for path:
s3://cloudsqale/hive/events.db/events/.hive-staging_hive_2018-12-16_20-34-14_029_6156246830953134355-1/_task_tmp.-ext-10002/event_dt=2018-12-16/_tmp.000000_0 
with stripeSize: 268435456 blockSize: 268435456 compression: ZLIB bufferSize: 262144

When a new row is added to the current stripe, ORC Memory Manager checks after every 5,000 rows (defined by orc.rows.between.memory.checks option that can be from 1 to 10,000 rows, not ppliad for Hive 2.1 and earlier) if the ORC writer needs to flush the current stripe and start the new one.

In Debug log you can see the following records:

DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000
DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 202588250 limit = 268435456
DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000
DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 424133109 limit = 268435456
DEBUG [main] org.apache.orc.impl.MemoryManager: flushed 
DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000
DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 121520440 limit = 268435456
DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000
DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 226542060 limit = 268435456
DEBUG [main] org.apache.orc.impl.MemoryManager: Notifying writers after 5000
DEBUG [main] org.apache.orc.impl.WriterImpl: ORC writer size = 334103206 limit = 268435456
DEBUG [main] org.apache.orc.impl.MemoryManager: flushed

You can see that the first stripe was flushed after 10,000 rows and the second stripe was flushed after 15,000 rows that exactly matches the structure of ORC file generated by orcfiledump utility shown above.

In this particular example, limit is the stripe size (in general it can be scaled depending on the memory pool size and number of writers), while size is the current size of data.

Why did size quickly become so big (remember that the entire source .gz file is just 120 MB) so it needs to be flushed very often?

The reason is that size is the uncompressed size of data, and in my example payload column is JSON data and compression rate for JSON can be more than 10x. Let’s now add some settings to create an ORC file with the single stripe.

I am going to change the stripe size to 1.2 GB at the table level, but it is just metadata, such large size practically will never be created when loading JSON data from .gz:

ALTER TABLE events SET TBLPROPERTIES ("orc.stripe.size"="1268435456");

Stripe size now exceeds the default heap at 1 GB, and the heap will limit the stripe buffer size now, so let’s increase it as well as the percentage of heap that can be used for the ORC buffer:

set mapreduce.map.memory.mb = 2048;
set mapreduce.map.java.opts = -Xmx1638m;

-- Allow stripe buffer to take more memory from heap
set hive.exec.orc.memory.pool = 0.8;

Now if you rerun the INSERT statement and look at the ORC file content, you can see that only 1 stripe was created:

Stripe: offset: 3 data: 161226974 rows: 25351 tail: 445 index: 27384

Hive 2.1, Amazon EMR and S3 Specific Issues

In our example, we converted a relatively small 120 MB .gz file into ORC. If you repeat the same steps but now for 4 GB .gz file, you can still see the following stripes:

Stripe: ... rows: 15000 
Stripe: ... rows: 5000
Stripe: ... rows: 5000
Stripe: ... rows: 5000
Stripe: ... rows: 5000
Stripe: ... rows: 25000
Stripe: ... rows: 5000
Stripe: ... rows: 25000
Stripe: ... rows: 5000
Stripe: ... rows: 25000
Stripe: ... rows: 5000
Stripe: ... rows: 25000

File length: 7868020056 bytes
Padding length: 1691786646 bytes
Padding ratio: 21.5%

You can see large (15,000 and 25,000 rows) and small (5,000 rows) stripes still interchange, and there is significant padding in 1.6 GB added to the ORC file.

The problem is that when the ORC writer flushes the current stripe it checks how much space is left in the block (defined by hive.exec.orc.default.block.size with 256 MiB by default).

If available block size/stripe size ratio is higher than the block padding tolerance value (defined by hive.exec.orc.block.padding.tolerance with the default 0.05), then the size of one or several next stripes is adjusted to fill in the block. That’s why small stripes with minimal 5,000 rows were created.

If available block size/stripe size ratio is lower than the block padding tolerance value then ORC just writes dummy bytes to pad the stripe to fill in the block.

Since we do not want to have small stripes, and there is no any value in padding the block for S3 object storage, we can use the following options to switch them off at the table level:

ALTER TABLE events SET TBLPROPERTIES ("orc.block.padding"="false", "orc.block.padding.tolerance"="1.0");