Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.
Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.
Consider data for a single partition of clicks
table:
$ aws s3 ls s3://cloudsqale/hive/clicks.db/clicks/event_dt=2018-11-08/ 2018-11-09 06:57:56 897099150 000054_0
There is only one ORC file having size 897,099,150 bytes (855 MB). Let’s see the metadata of this ORC file:
$ hive --orcfiledump s3://cloudsqale/hive/clicks.db/clicks/event_dt=2018-11-08/000054_0 Rows: 52541307 Compression: ZLIB Compression size: 131072 ... Stripe Statistics: Stripe 1 ... Stripe 402 ... Stripes: Stripe: offset: 3 data: 7144608 rows: 425000 tail: 634 index: 20539 ... Stripe: offset: 530610783 data: 91139 rows: 5000 tail: 538 index: 1292 ... Stripe: offset: 891009703 data: 5985554 rows: 356307 tail: 631 index: 17484
This ORC file has too large number of stripes 402 (~2 MB per stripe on average, while normally stripe should be 64-256 MB), but let’s not pay attention to this right now, probably it was generated by merging small ORC files at the stripe level by some ETL process. You can also notice that some stripes have 425K rows, while others 5K rows.
Let’s now run some queries in Hive and check how many mappers are used:
set hive.exec.orc.split.strategy=BI; select product, count(*) from clicks where event_dt='2018-11-08' group by product;
You can see that 13 map tasks were launched:
Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1514927654320_89828) Map 1: 0/13 Reducer 2: 0/157
When hive.exec.orc.split.strategy=BI
is set, Hive does not read ORC stripe information to define the number of input splits. Instead it uses the following algorithm:
- Define the input size –
897,099,150
bytes - Define the block size –
fs.s3n.block.size=67108864
(depending on S3 URI schema it can befs.s3.block.size
orfs.s3a.block.size
) - Get the number of input splits as
input size/fs.s3n.block.size = 897099150/67108864 = 13
- Check the input split size
897099150/13 = 69007626
to make sure it is withintez.grouping.min-size=52428800
andtez.grouping.max-size=1073741824
by default. Since the split grouping is not required, and 13 Map tasks are created.
Now I will try to increase the number of Map tasks by reducing S3 block size option:
set hive.exec.orc.split.strategy=BI; set fs.s3n.block.size=5000000; -- 5M select product, count(*) from clicks where event_dt='2018-11-08' group by product;
Although I set fs.s3n.block.size=5000000
(5M), Hive grouped splits to satisfy tez.grouping.min-size=52428800
(52.4M), so I got 20 Map tasks only:
Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1514927654320_89834) Map 1: 0/20 Reducer 2: 0/157
Let’s add one more setting:
set hive.exec.orc.split.strategy=BI; set fs.s3n.block.size=5000000; -- 5M set tez.grouping.min-size=5000000; -- 5M select product, count(*) from clicks where event_dt='2018-11-08' group by product;
And now we get 179 Map tasks:
Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1514927654320_89839) Map 1: 0/179 Reducer 2: 0/157
Now let’s experiment with hive.exec.orc.split.strategy=ETL
option. With this option Hive reads ORC stripe information to define the number of input splits and Map tasks.
set hive.exec.orc.split.strategy=ETL; select product, count(*) from clicks where event_dt='2018-11-08' group by product;
There are 7 Maps tasks were launched in my environment:
Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1514927654320_89840) Map 1: 0/7 Reducer 2: 0/157
When hive.exec.orc.split.strategy=ETL
is set, the following algorithm is applied:
- Read ORC stripe information – there are 402 stripes in my sample ORC file
- Combine stripes to create inputs splits within
mapreduce.input.fileinputformat.split.minsize=128000000
andmapreduce.input.fileinputformat.split.maxsize=256000000
(these values are set in my EMR cluster). This gives897099150/128000000=7
input splits.
Now let’s try to increase the number of Map tasks:
set hive.exec.orc.split.strategy=ETL; set mapreduce.input.fileinputformat.split.maxsize=5000000; -- 5M set tez.grouping.min-size=50000000; -- 50M, ignored! select product, count(*) from clicks where event_dt='2018-11-08' group by product;
Now we got 127 Map tasks:
Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1514927654320_89862) Map 1: 0/127 Reducer 2: 0/157
Note that in this case Hive ignored tez.grouping.min-size
option and did not perform split grouping.