Amazon,  AWS,  EMR,  Hive,  I/O,  S3

S3 Writes When Inserting Data into a Hive Table in Amazon EMR

Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.

Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:

INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00')
SELECT
  record_id,
  event_timestamp,
  event_name,
  app_name,
  country,
  city,
  payload
FROM events_raw;

The raw data set is about 500 GB stored in 1200 .gz files, so Tez started 1200 tasks:

Total jobs = 1
Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1514927654320_91738)

Map 1: 0/1200
Map 1: 0(+1200)/1200
...
Map 1: 1200/1200

It took 451 seconds (7.5 minutes) to complete 1200 mappers, and then the next message appeared in the log only after another 18 minutes:

Loading data to table events partition (event_dt=2018-12-02, event_hour=00)

And finally the job done after another 13.5 minutes:

OK
Time taken: 2358.849 seconds

So the entire job took 40 minutes. Let’s investigate why it was so long, and which S3 write operations were performed.

Map Task Writes

If we look at the log of each Map task, we can see the following:

2018-12-04 14:02:05 
Opening 's3://cloudsqale/hive/events.db/events_raw/00919.gz' for reading

2018-12-04 14:02:06 exec.FileSinkOperator
Writing to temp file: FS s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_task_tmp.-ext-10000/_tmp.000649_0

2018-12-04 14:07:39 rename 
s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_task_tmp.-ext-10000/_tmp.000649_0

s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_tmp.-ext-10000/000649_0

2018-12-04 14:07:51 Task completed

You can see that the map task first wrote data to temporary file _task_tmp.-ext-10000/_tmp.000649_0 in the staging directory on S3, and it actually took just 5.5 minutes to get the result!

In completion the map task just renamed (moved in S3) _task_tmp.-ext-10000/_tmp.000649_0
to _tmp.-ext-10000/000649_0, it took another 10 seconds.

You can see that after all map tasks completed all data are already on S3! And it took just 7.5 minutes (the longest map task time). So what did Hive do another 32.5 minutes?

Moving Temporary Directory

When all map tasks are completed, Hive driver moves files into the temporary directory. You can see the following messages in Hive log:

2018-12-04 14:09:09 Map 1: 1200/1200

2018-12-04 14:09:10 exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1391)) - Moving tmp dir: s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/_tmp.-ext-10000 
to: 
s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000

2018-12-04 14:27:05

So this move from one temporary location into another within the staging directory took 18 minutes.

Loading Data to Table

The final step is loading data to the table. Again you can see the following messages in Hive log:

2018-12-04 14:27:05 Loading data to table events partition (event_dt=2018-12-02, event_hour=00) 
from s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000

2018-12-04 14:27:08 Creating directory if it doesn't exist: 
s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00

2018-12-04 14:27:08 rename 
s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000/000000_0 
s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/000000_0
...
2018-12-04 14:39:20 rename 
s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/.hive-staging_hive_2018-12-04_14-01-28_101_2887324714059975682-1/-ext-10000/001199_0 
s3://cloudsqale/hive/events.db/events/event_dt=2018-12-02/event_hour=00/001199_0

2018-12-04 14:39:39

You can see that Hive does the final rename (move in S3) of 1200 files sequentially! This took another 12 minutes.

Conclusion

This simple example shows that Amazon EMR Hive performs the following S3 write operation for map-only job:

  • Map Write – Each map job writes data to _task_tmp.-ext-10000/_tmp.XXXXX_0 file in the staging directory on S3. It took 7.5 minutes and actually this is the data we need in the final table!
  • Moving Temporary Directory – Hive driver moves all map files from _tmp.-ext-10000 to -ext-10000 directory within the staging directory on S3. It is just a copy within S3 that took 18 minutes.
  • Loading Data to Table – Hive driver moves all map files to the final destination sequentially. It is just one more copy within S3 that took additional 12 minutes.

Test were done on Amazon EMR Hive 2.1.1