Usually in a Data Lake we get source data as compressed JSON payloads (.gz
files). Additionally, the first level of JSON objects is often parsed into map<string, string>
structure to speed up the access to the first level keys/values, and then get_json_object
function can be used to parse further JSON levels whenever required.
But it still makes sense to convert data into the ORC format to evenly distribute data processing to smaller chunks, or to use indexes and optimize query execution for complementary columns such as event names, geo information, and some other system attributes.
In this example we will load the source data stored in single 2.5 GB .gz
file into the following ORC table:
CREATE TABLE events ( event_timestamp string, event_name string, app_name string, app_version string, platform string, city string, country string, payload map<string, string> -- Mapped JSON data ) PARTITIONED BY (event_dt string) STORED AS ORC LOCATION 's3://cloudsqale/hive/events.db/events' TBLPROPERTIES ("orc.stripe.size"="268435456", -- 256 MB stripe size "orc.block.padding"="false","orc.block.padding.tolerance"="1.0"); INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-26') SELECT event_timestamp, event_name, app_name, app_version, platform, city, country, payload FROM events_raw;
You can see that 3.2 GB ORC file was created from 2.5 GB .gz
file:
$ aws s3 ls s3://cloudsqale/hive/events.db/events/event_dt=2018-12-26/ 2018-12-26 17:11:38 3197710281 000000_0
Let’s examine its structure:
$ hive --orcfiledump s3://cloudsqale/hive/events.db/events/event_dt=2018-12-26/000000_0 File Version: 0.12 with HIVE_13083 Rows: 886660 Compression: ZLIB ... Stripe Statistics: Stripe 1: ... Stripe 60: Stripes: Stripe: offset: 3 data: 54383999 rows: 15000 tail: 523 index: 2896 Stripe: offset: 54387421 data: 52976345 rows: 15000 tail: 526 index: 2867 Stripe: offset: 107367159 data: 52965312 rows: 15000 tail: 526 index: 3028 ... Stripe: offset: 3135846190 data: 55464332 rows: 15000 tail: 520 index: 2884 Stripe: offset: 3191313926 data: 6373925 rows: 1660 tail: 499 index: 1858 File length: 3197710281 bytes Padding length: 0 bytes Padding ratio: 0%
Note that although the table DDL specifies 256 MB stripes, there are 60 stripes were created having 54 MB on average (6.3 MB for the last tail stripe). From ORC file dump you can also see that DICTIONARY_V2 encoding was used (I skipped this output from orcfiledump
).
Now let’s query data from this table using 1 GB Tez containers (having ~0.8 GB of Java Heap Size):
set mapreduce.input.fileinputformat.split.minsize = 16777216; -- 16 MB set tez.grouping.max-size = 52428800; -- 52 MB set hive.tez.container.size = 1024; SELECT platform, COUNT(DISTINCT app_name) FROM events WHERE event_dt = '2018-12-26' GROUP BY platform;
The query fails with the Java Heap space error:
Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.orc.impl.DynamicByteArray.get(DynamicByteArray.java:283) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1523) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1213) at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:1987) at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1739) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1054) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:89) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:231)
Actually I need to use 2 GB Tez containers to run this query successfully for events
table with map and dictionary encoding in ORC.
Now let’s turn off the dictionary encoding and reload the data:
ALTER TABLE events SET TBLPROPERTIES ("orc.dictionary.key.threshold"="0");
The ORC file is still 3.2 GB, but its structure is significantly different now:
$ hive --orcfiledump s3://cloudsqale/hive/events.db/events/event_dt=2018-12-26/000000_0 File Version: 0.12 with HIVE_13083 Rows: 886660 Compression: ZLIB ... Stripe Statistics: Stripe 1: ... Stripe 14: Stripes: Stripe: offset: 3 data: 36710800 rows: 10000 tail: 418 index: 1883 Stripe: offset: 36713104 data: 255966151 rows: 70000 tail: 465 index: 6414 Stripe: offset: 292686134 data: 256975294 rows: 70000 tail: 462 index: 6342 ... Stripe: offset: 2854597332 data: 253161067 rows: 70000 tail: 464 index: 6053 Stripe: offset: 3107764916 data: 136621419 rows: 36660 tail: 452 index: 4309 File length: 3244397986 bytes Padding length: 0 bytes Padding ratio: 0%
Now there are only 14 stripes, and most of them are 255 MB. And if you restart the query with 1 GB Tez container size, it succeeds now even processing larger stripes per map task (255 MB vs 55 MB).
So turning off the dictionary encoding for a table with mapped JSON objects you can get much larger ORC stripes and avoid Java heap errors even using small 1 GB Tez containers.