Amazon,  AWS,  EMR,  Hive,  ORC,  Tez

Tez Internals #2 – Number of Map Tasks for Large ORC Files with Small Stripes in Amazon EMR

Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.

Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.

Consider data for a single partition of clicks table:

$ aws s3 ls s3://cloudsqale/hive/clicks.db/clicks/event_dt=2018-11-08/
  2018-11-09 06:57:56  897099150 000054_0

There is only one ORC file having size 897,099,150 bytes (855 MB). Let’s see the metadata of this ORC file:

$ hive --orcfiledump s3://cloudsqale/hive/clicks.db/clicks/event_dt=2018-11-08/000054_0

Rows: 52541307
Compression: ZLIB
Compression size: 131072
...
Stripe Statistics:
  Stripe 1 ... Stripe 402
...
Stripes:
  Stripe: offset: 3 data: 7144608 rows: 425000 tail: 634 index: 20539
  ...
  Stripe: offset: 530610783 data: 91139 rows: 5000 tail: 538 index: 1292
  ...
  Stripe: offset: 891009703 data: 5985554 rows: 356307 tail: 631 index: 17484 

This ORC file has too large number of stripes 402 (~2 MB per stripe on average, while normally stripe should be 64-256 MB), but let’s not pay attention to this right now, probably it was generated by merging small ORC files at the stripe level by some ETL process. You can also notice that some stripes have 425K rows, while others 5K rows.

Let’s now run some queries in Hive and check how many mappers are used:

set hive.exec.orc.split.strategy=BI; 

select product, count(*) 
from clicks
where event_dt='2018-11-08' 
group by product;

You can see that 13 map tasks were launched:

Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1514927654320_89828)
Map 1: 0/13     Reducer 2: 0/157

When hive.exec.orc.split.strategy=BI is set, Hive does not read ORC stripe information to define the number of input splits. Instead it uses the following algorithm:

  • Define the input size – 897,099,150 bytes
  • Define the block size – fs.s3n.block.size=67108864 (depending on S3 URI schema it can be fs.s3.block.size or fs.s3a.block.size)
  • Get the number of input splits as input size/fs.s3n.block.size = 897099150/67108864 = 13
  • Check the input split size 897099150/13 = 69007626 to make sure it is within tez.grouping.min-size=52428800 and tez.grouping.max-size=1073741824 by default. Since the split grouping is not required, and 13 Map tasks are created.

Now I will try to increase the number of Map tasks by reducing S3 block size option:

set hive.exec.orc.split.strategy=BI; 
set fs.s3n.block.size=5000000;  -- 5M

select product, count(*) 
from clicks
where event_dt='2018-11-08' 
group by product;

Although I set fs.s3n.block.size=5000000 (5M), Hive grouped splits to satisfy tez.grouping.min-size=52428800 (52.4M), so I got 20 Map tasks only:

Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1514927654320_89834)
Map 1: 0/20     Reducer 2: 0/157

Let’s add one more setting:

set hive.exec.orc.split.strategy=BI; 
set fs.s3n.block.size=5000000;       -- 5M
set tez.grouping.min-size=5000000;   -- 5M

select product, count(*) 
from clicks
where event_dt='2018-11-08' 
group by product;

And now we get 179 Map tasks:

Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1514927654320_89839)
Map 1: 0/179    Reducer 2: 0/157

Now let’s experiment with hive.exec.orc.split.strategy=ETL option. With this option Hive reads ORC stripe information to define the number of input splits and Map tasks.

set hive.exec.orc.split.strategy=ETL; 

select product, count(*) 
from clicks
where event_dt='2018-11-08' 
group by product;

There are 7 Maps tasks were launched in my environment:

Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1514927654320_89840)
Map 1: 0/7      Reducer 2: 0/157

When hive.exec.orc.split.strategy=ETL is set, the following algorithm is applied:

  • Read ORC stripe information – there are 402 stripes in my sample ORC file
  • Combine stripes to create inputs splits within mapreduce.input.fileinputformat.split.minsize=128000000 and mapreduce.input.fileinputformat.split.maxsize=256000000 (these values are set in my EMR cluster). This gives 897099150/128000000=7 input splits.

Now let’s try to increase the number of Map tasks:

set hive.exec.orc.split.strategy=ETL; 
set mapreduce.input.fileinputformat.split.maxsize=5000000; -- 5M
set tez.grouping.min-size=50000000; -- 50M, ignored!

select product, count(*) 
from clicks
where event_dt='2018-11-08' 
group by product;

Now we got 127 Map tasks:

Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1514927654320_89862)
Map 1: 0/127    Reducer 2: 0/157

Note that in this case Hive ignored tez.grouping.min-size option and did not perform split grouping.