Apache Hive on Tez – Quick On The Fly Profiling of Long Running Tasks Using Jstack Probes and Flame Graphs

I was asked to diagnose and tune a long and complex ad-hoc Hive query that spent more than 4 hours on the reduce stage. The fetch from the map tasks and the merge phase completed fairly quickly (within 10 minutes) and the reducers spent most of their time iterating the input rows and performing the aggregations defined by the query – MIN, SUM, COUNT and PERCENTILE_APPROX and others on the specific columns.

After the merge phase a Tez reducer does not output many log records to help you diagnose the performance issues and find the bottlenecks. In this article I will describe how you can profile an already running Tez task without restarting the job.

Assume the job has been already running for a long time and reducers seem to “hang”. What are they really doing?

Let’s use the jstack command-line JVM utility to profile our tasks. It can attach to the specified JVM process and print the stack traces for all its threads.

For my query, I noticed that all reducers had been running for a long time, and none of them finished yet:

Map 1: 11108/11108 Reducer 2: 0(+1000)/1000

This may mean there is no significant data skew, all reducers are equally busy. But for us this also means that we do not need to profile all reducers, it would be enough to probe one of them only. Within the same Tez vertex reducers perform similar operation but just on different chunks of data.

So let’s connect to one of the cluster nodes running a reduce task, and find out ID of the TezChild process:

$ ssh -i private_key hadoop@10.x.x.x

$ sudo jps -l

3586 aws157.logpusher.Main
5809 org.apache.tez.runtime.task.TezChild
5010 org.apache.hadoop.yarn.server.nodemanager.NodeManager
3678 aws157.instancecontroller.Main
...

When you know ID of the JVM process you can get the stack trace for all its threads as follows:

sudo -u yarn jstack 5809

Typically there are many threads running, and we are interested in the stack trace for the TezChild thread only, for example:

"TezChild" #21 daemon prio=5 os_prio=0 tid=0x00007fa2a97ea000 nid=0x38c2 runnable [0x00007fa282b69000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.trim(NumericHistogram.java:225)
        at org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHistogram.java:161)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox$
          GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentileApprox.java:318)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.
          aggregate(GenericUDAFEvaluator.java:191)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:619)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:858)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:702)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:768)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$
          GroupIterator.next(ReduceRecordSource.java:348)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.
          pushRecord(ReduceRecordSource.java:242)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:279)
...

If we collect the stack trace multiple times we can see where the process spends most of its time.

Let’s collect the stack trace from TezChild process 3,000 times with 100 milliseconds interval. Jstack calls have some overhead so we will collect stacks for about 20 minutes. It is enough in many cases as the reducer does a repeatable operation:

i=0; while (( i++ < 3000 )); do jstack 5809 >> tezchild_stacks.txt; sleep 0.1; done

Note you have to avoid using sudo -u yarn for calling jstack in a loop as it adds a significant overhead.

Next you can use Brendan Gregg’s tools (https://github.com/brendangregg/FlameGraph) to process the generated stack trace and build a flame graph for the TezChild thread:

$ ./stackcollapse-jstack.pl tezchild_stacks.txt | grep TezChild > tezchild_stacks_collapsed.txt
$ ./flamegraph.pl --colors green tezchild_stacks_collapsed.txt > tezchild_stacks.svg

Here is my sample result:

We took 3,000 TezChild stack trace samples and the reducer was 2,943 times (98,10%) in the GenericUDAFPercentileApprox function!

In my case, the query invoked the PERCENTILE_APPROX function multiple times, and it turned out to be a quite expensive operation:

PERCENTILE_APPROX(load_time, 0.25) as p25_time,
PERCENTILE_APPROX(load_time, 0.50) as p50_time,
PERCENTILE_APPROX(load_time, 0.75) as p75_time,
PERCENTILE_APPROX(load_time, 0.90) as p90_time,
PERCENTILE_APPROX(load_time, 0.95) as p95_time,
PERCENTILE_APPROX(load_time, 0.99) as p99_time

But the PERCENTILE_APPROX function allows you to specify an array of percentiles values that helps reduce the number of sorts:

PERCENTILE_APPROX(load_time, array(0.25, 0.50, 0.75, 0.90, 0.95, 0.99))

In my case this change reduced the query execution from 5 hours to about 1 hour.

You can see that even simple jstack profiler can help you easily find the performance problems even in very complex queries and DAGs. Additionally, for more advanced profiling you can use an Java agent-based profilers.