Let’s see how Apache Tez defines the number of map tasks when you execute a SQL query in Hive running on the Tez engine.
Consider the following sample SQL query on a partitioned table:
select count(*) from events where event_dt = '2018-10-20' and event_name = 'Started';
The events
table is partitioned by event_dt
and event_name
columns, and it is quite big – it has 209,146
partitions, while the query requests data from a single partition only:
$ hive -e "show partitions events"
event_dt=2017-09-22/event_name=ClientMetrics
event_dt=2017-09-23/event_name=ClientMetrics
...
event_dt=2018-10-20/event_name=Location
...
event_dt=2018-10-20/event_name=Started
Time taken: 26.457 seconds, Fetched: 209146 row(s)