ORC,  Storage

Storage Tuning for Mapped JSON Conversion to ORC File Format – Java Heap Issues with Dictionary Encoding

Usually in a Data Lake we get source data as compressed JSON payloads (.gz files). Additionally, the first level of JSON objects is often parsed into map<string, string> structure to speed up the access to the first level keys/values, and then get_json_object function can be used to parse further JSON levels whenever required.

But it still makes sense to convert data into the ORC format to evenly distribute data processing to smaller chunks, or to use indexes and optimize query execution for complementary columns such as event names, geo information, and some other system attributes.

In this example we will load the source data stored in single 2.5 GB .gz file into the following ORC table:

CREATE TABLE events
(
  event_timestamp string,
  event_name      string,
  app_name        string,
  app_version     string,
  platform        string,
  city            string,
  country         string,
  payload         map<string, string>   -- Mapped JSON data
)
PARTITIONED BY (event_dt string)
STORED AS ORC
 LOCATION 's3://cloudsqale/hive/events.db/events'
 TBLPROPERTIES ("orc.stripe.size"="268435456",      -- 256 MB stripe size
   "orc.block.padding"="false","orc.block.padding.tolerance"="1.0");

INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-26')
SELECT
  event_timestamp,
  event_name,
  app_name,
  app_version,
  platform,
  city,
  country,
  payload
FROM events_raw;

You can see that 3.2 GB ORC file was created from 2.5 GB .gz file:

$ aws s3 ls s3://cloudsqale/hive/events.db/events/event_dt=2018-12-26/

2018-12-26 17:11:38 3197710281 000000_0

Let’s examine its structure:

$ hive --orcfiledump s3://cloudsqale/hive/events.db/events/event_dt=2018-12-26/000000_0

File Version: 0.12 with HIVE_13083
Rows: 886660
Compression: ZLIB
...
Stripe Statistics:
  Stripe 1:
  ...
  Stripe 60:
Stripes:
  Stripe: offset: 3 data: 54383999 rows: 15000 tail: 523 index: 2896
  Stripe: offset: 54387421 data: 52976345 rows: 15000 tail: 526 index: 2867
  Stripe: offset: 107367159 data: 52965312 rows: 15000 tail: 526 index: 3028
  ...
  Stripe: offset: 3135846190 data: 55464332 rows: 15000 tail: 520 index: 2884
  Stripe: offset: 3191313926 data: 6373925 rows: 1660 tail: 499 index: 1858

File length: 3197710281 bytes
Padding length: 0 bytes
Padding ratio: 0%

Note that although the table DDL specifies 256 MB stripes, there are 60 stripes were created having 54 MB on average (6.3 MB for the last tail stripe). From ORC file dump you can also see that DICTIONARY_V2 encoding was used (I skipped this output from orcfiledump).

Now let’s query data from this table using 1 GB Tez containers (having ~0.8 GB of Java Heap Size):

set mapreduce.input.fileinputformat.split.minsize = 16777216;    -- 16 MB
set tez.grouping.max-size = 52428800;                            -- 52 MB

set hive.tez.container.size = 1024;

SELECT platform, COUNT(DISTINCT app_name)
FROM events
WHERE event_dt = '2018-12-26'
GROUP BY platform;

The query fails with the Java Heap space error:

Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.orc.impl.DynamicByteArray.get(DynamicByteArray.java:283)
at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1523)
at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1213)
at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:1987)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1739)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1054)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:89)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:231)

Actually I need to use 2 GB Tez containers to run this query successfully for events table with map and dictionary encoding in ORC.

Now let’s turn off the dictionary encoding and reload the data:

ALTER TABLE events SET TBLPROPERTIES ("orc.dictionary.key.threshold"="0");

The ORC file is still 3.2 GB, but its structure is significantly different now:

$ hive --orcfiledump s3://cloudsqale/hive/events.db/events/event_dt=2018-12-26/000000_0

File Version: 0.12 with HIVE_13083
Rows: 886660
Compression: ZLIB
...
Stripe Statistics:
  Stripe 1:
  ...
  Stripe 14:
Stripes:
  Stripe: offset: 3 data: 36710800 rows: 10000 tail: 418 index: 1883
  Stripe: offset: 36713104 data: 255966151 rows: 70000 tail: 465 index: 6414
  Stripe: offset: 292686134 data: 256975294 rows: 70000 tail: 462 index: 6342
  ...
  Stripe: offset: 2854597332 data: 253161067 rows: 70000 tail: 464 index: 6053
  Stripe: offset: 3107764916 data: 136621419 rows: 36660 tail: 452 index: 4309

File length: 3244397986 bytes
Padding length: 0 bytes
Padding ratio: 0%

Now there are only 14 stripes, and most of them are 255 MB. And if you restart the query with 1 GB Tez container size, it succeeds now even processing larger stripes per map task (255 MB vs 55 MB).

So turning off the dictionary encoding for a table with mapped JSON objects you can get much larger ORC stripes and avoid Java heap errors even using small 1 GB Tez containers.