I was asked to tune a Hive query that ran more than 10 hours. It was running on a 100 node cluster with 16 GB available for YARN containers on each node.
Although the query processed about 2 TB of input data, it did a fairly simple aggregation on
user_id column and did not look too complex. It had 1 Map stage with 1,500 tasks and 1 Reduce stage with 7,000 tasks.
All map tasks completed within 30 minutes, and the query stuck on the Reduce phase. So what was wrong?