Hive on Tez – Shuffle Failed with Too Many Fetch Failures and Insufficient Progress

On one of the clusters I noticed an increased rate of shuffle errors, and the restart of a job did not help, it still failed with the same error.

The error was as follows:

 Error: Error while running task ( failure ) : 
  org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: 
    error in shuffle in Fetcher 
 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal
 (Shuffle.java:301)

Caused by: java.io.IOException: 
  Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1,
    pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true

Looking at the container log I could see more details:

2020-02-26 13:09:38,159 [WARN] [Fetcher #0] |orderedgrouped.FetcherOrderedGrouped|: 
  Failed to verify reply after connecting to 10.85.65.44:13562 with 1 inputs pending

  java.io.IOException: Server returned HTTP response code: 401 for URL: 
   http://10.85.65.44:13562/mapOutput?...
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1876)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
	at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:248)
	
  at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped
   .setupConnection (FetcherOrderedGrouped.java:348)
  at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped
   .copyFromHost (FetcherOrderedGrouped.java:260)
  at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped
   .fetchNext (FetcherOrderedGrouped.java:178)

Notice HTTP response code: 401 returned by a remote Node Manager when requesting a map output. How is it possible that the Node Manager return HTTP 401 (Unauthorized) error ?

One of the possible reasons is the Node Manager restart during the job execution. To check this I went to the host 10.85.65.44 and checked its Node Manager log:

$ ssh -i key hadoop@10.85.65.44

$cat /var/log/hadoop-yarn/yarn-yarn-nodemanager-ip-10-85-65-44.log

In the log I can see that the Node Manager is started during the job execution. That means it was terminated before that:

2020-02-26 13:09:23,233 INFO org.apache.hadoop.yarn.webapp.WebApps (main): Web app node started at 8042
2020-02-26 13:09:23,239 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to 
 ResourceManager at 10.85.65.17:8025
2020-02-26 13:09:23,259 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl (main): 
 Sending out 0 NM container statuses: []
2020-02-26 13:09:23,264 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl (main): 
 Registering with RM using containers :[]
2020-02-26 13:09:23,306 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl (main): 
  Registered with ResourceManager as 10.85.65.17:8041 with total resource of <memory:53248, vCores:36>

And later in the same log:

2020-02-26 13:09:25,953 WARN org.apache.hadoop.mapred.ShuffleHandler (New I/O worker #2): 
  Shuffle failure org.apache.hadoop.security.token.SecretManager$InvalidToken: 
    Can't find job token for job job_1581241769187_51527 !!

job_1581241769187_51527 is exactly the same job that failed with HTTP 401 error when asking for a map output for from this node!

So the reason of the error is the Node Manager restart (probably due to Java Heap Space or other error) that lost the security token and cannot allow other Node Managers to get the shuffle data.

If it is not a transient error it makes sense to investigate why the Node Manager terminates and then restarts (in my case by the Amazon EMR instance controller) and try to resolve it.