ORC Files Split Computation – Hive on Tez – Large-Scale Data Engineering in Cloud

In the previous article I already wrote about splits generation (see Tez Internals #2 – Number of Map Tasks for Large ORC Files), and here I would like to share some more details.

I have 143 GB of daily data for clicks located in 33 files in ORC format:

$ aws s3 ls s3://cloudsqale/hive/clicks.db/event_dt=2018-12-20/ --summarize

2018-12-21 14:10:06 9475556240 part-m-00029
2018-12-21 14:09:50 9394567975 part-m-00030
2018-12-21 14:10:04 9321314411 part-m-00031
...
2018-12-21 14:17:19 1035852738 part-m-00959

Total Objects: 33
   Total Size: 143771577882

Let’s take a single file and examine its structure:

$ hive --orcfiledump s3://cloudsqale/hive/clicks.db/event_dt=2018-12-20/part-m-00029

...
Rows: 1296757
Compression: ZLIB
...
Stripe 1:
...
Stripe 66:

Stripes:
  Stripe: offset: 3 data: 69219899 rows: 10000 tail: 530 index: 3156
  Stripe: offset: 69223588 data: 146348802 rows: 20000 tail: 542 index: 43990
  Stripe: offset: 215616922 data: 149172427 rows: 20000 tail: 544 index: 24583
  ...
  Stripe: offset: 9429844027 data: 50141419 rows: 6757 tail: 529 index: 22221

File length: 9475556240 bytes
Padding length: 0 bytes
Padding ratio: 0%

You can see that the file has 66 stripes ranging from 70 MB (50 MB for the last tailing stripe) to 150 MB.

Now let’s run a sample SQL query and see how Hive will calculate input splits for this data set:

SELECT page_type, COUNT(DISTINCT url)
FROM clicks
WHERE event_dt = '2018-12-20'
GROUP BY page_type

But before we execute this query let’s check the options set for the cluster:

set mapreduce.input.fileinputformat.split.minsize = 16777216;    -- 16 MB
set mapreduce.input.fileinputformat.split.maxsize = 268435456;   -- 268 MB

set tez.grouping.min-size = 52428800;     -- 52 MB
set tez.grouping.max-size = 1073741824;   -- 1 GB

For this query, Tez launched 907 Map tasks:

Map 1: 907/907      Reducer 2: 134/134
OK

If you look at Tez UI, you can see the following split sizes:

…

So the largest map tasks process splits that include 2 stripes (mostly 20,000 and 20,000 rows) while the smallest tasks process a single stripe only (20,000 rows).

Why were ORC stripes grouped this way? There are two steps to compute and group splits in Tez processing ORC files:

1. Split Computation and Grouping by OrcInputFormat

Firstly, OrcInputFormat reads all stripes from all input ORC files, and creates one split per stripe unless the stripe size is lower than mapreduce.input.fileinputformat.split.minsize value.

If the stripe is smaller than mapreduce.input.fileinputformat.split.minsize then OrcInputFormat combines multiple stripes into a single input split.

2. Split Grouping by Tez

In addition to OrcInputFormat, the Tez engine can further group the input splits.

Initially Tez asks the YARN Resource Manager about the number of available containers, multiplies this number by tez.grouping.split-waves (1.7 by default; for more information about split ways, read Tez Internals #1 – Number of Map Tasks) and gets the desired number of input splits (and tasks).

From the Tez Application Master log you can see:

[INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of input splits: 1084. 337 available slots, 1.7 waves.
[INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Estimated number of tasks: 572

1084 is the total number of stripes in all input ORC files, 337 is the total number of available containers, and 337 * 1.7 gives 572 tasks.

Tez knows that it wants to run 572 tasks, so it defines the average data size per task:

[INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Grouping splits in Tez
[INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired numSplits: 572 lengthPerGroup: 822989185 numLocations: 1 numSplitsPerLocation: 1084 numSplitsInGroup: 1 totalLength: 470749813876 numOriginalSplits: 1084 . Grouping by length: true count: false nodeLocalOnly: false

Tez calculates totalLength that is the total volume of input data to process. Note that Tez uses the column statistics from ORC files, not from Hive Metastore (!) to get the estimated uncompressed data size. That’s why totalLength is 470 GB while the total size of input ORC files is just 143 GB.

Knowing the total data size and the desired number of tasks, Tez gets lengthPerGroup, the desired size of input split: 470749813876/572 = 822989185. So the desired input split is 822 MB (!), and again it is the uncompressed data size.

Similar to OrcInputFormat, Tez also goes through all ORC stripes (actually input splits created by OrcInputFormat), but now it deals with the uncompressed data sizes.

If the input split is smaller than tez.grouping.min-size then it is combined with another split trying to create input splits having the lengthPerGroup size.

It is hard to get this exact size, since Tez has to operate with full ORC stripes, it cannot split a single stripe into multiple input splits:

[INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Number of splits desired: 572 created: 907 splitsProcessed: 1084

That’s why 907 tasks were created: because of the large sizes of ORC stripes, Tez could not combine 1084 stripes into 572, it could combine only to 907 input splits i.e. tasks.

Largest Number of Tasks

For the current version of ORC file format you cannot create more map tasks than the number of stripes in ORC files. But how can we launch 1084 tasks instead of 907?

You can just set tez.grouping.max-size to a value lower than the minimal stripe size and Tez will not combine the stripes anymore:

set tez.grouping.max-size = 52428800;    -- 52 MB

-- Restart the query
Total jobs = 1
Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1514927654320_94119)

Map 1: 0/1084   Reducer 2: 0/134

So as expected Tez executed 1084 tasks now, equal to the number of stripes in all input ORC files.

Tests were performed on Amazon EMR Hive 2.1.1