Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.
Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00') SELECT record_id, event_timestamp, event_name, app_name, country, city, payload FROM events_raw;
The raw data set is about 500 GB stored in 1200 .gz files, so Tez started 1200 tasks:
Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1514927654320_91738) Map 1: 0/1200 Map 1: 0(+1200)/1200 ... Map 1: 1200/1200
It took 451 seconds (7.5 minutes) to complete 1200 mappers, and then the next message appeared in the log only after another 18 minutes:
Loading data to table events partition (event_dt=2018-12-02, event_hour=00)
And finally the job done after another 13.5 minutes:
OK Time taken: 2358.849 seconds
So the entire job took 40 minutes. Let’s investigate why it was so long, and which S3 write operations were performed.
Map Task Writes
If we look at the log of each Map task, we can see the following:
2018-12-04 14:02:05 Opening 's3://cloudsqale/hive/events.db/events_raw/00919.gz' for reading 2018-12-04 14:02:06 exec.FileSinkOperator Writing to temp file: FS s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_task_tmp.-ext-10000/_tmp.000649_0 2018-12-04 14:07:39 rename s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_task_tmp.-ext-10000/_tmp.000649_0 s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_tmp.-ext-10000/000649_0 2018-12-04 14:07:51 Task completed
You can see that the map task first wrote data to temporary file
_task_tmp.-ext-10000/_tmp.000649_0 in the staging directory on S3, and it actually took just 5.5 minutes to get the result!
In completion the map task just renamed (moved in S3)
_tmp.-ext-10000/000649_0, it took another 10 seconds.
You can see that after all map tasks completed all data are already on S3! And it took just 7.5 minutes (the longest map task time). So what did Hive do another 32.5 minutes?
Moving Temporary Directory
When all map tasks are completed, Hive driver moves files into the temporary directory. You can see the following messages in Hive log:
2018-12-04 14:09:09 Map 1: 1200/1200 2018-12-04 14:09:10 exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1391)) - Moving tmp dir: s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_tmp.-ext-10000 to: s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000 2018-12-04 14:27:05
So this move from one temporary location into another within the staging directory took 18 minutes.
Loading Data to Table
The final step is loading data to the table. Again you can see the following messages in Hive log:
2018-12-04 14:27:05 Loading data to table events partition (event_dt=2018-12-02, event_hour=00) from s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000 2018-12-04 14:27:08 Creating directory if it doesn't exist: s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00 2018-12-04 14:27:08 rename s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000/000000_0 s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/000000_0 ... 2018-12-04 14:39:20 rename s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000/001199_0 s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/001199_0 2018-12-04 14:39:39
You can see that Hive does the final rename (move in S3) of 1200 files sequentially! This took another 12 minutes.
This simple example shows that Amazon EMR Hive performs the following S3 write operation for map-only job:
- Map Write – Each map job writes data to
_task_tmp.-ext-10000/_tmp.XXXXX_0file in the staging directory on S3. It took 7.5 minutes and actually this is the data we need in the final table!
- Moving Temporary Directory – Hive driver moves all map files from
-ext-10000directory within the staging directory on S3. It is just a copy within S3 that took 18 minutes.
- Loading Data to Table – Hive driver moves all map files to the final destination sequentially. It is just one more copy within S3 that took additional 12 minutes.
Test were done on Amazon EMR Hive 2.1.1