YARN Resource Manager Silent Restarts – Java Heap Space Error – Amazon EMR

When you run a job in Hadoop you can notice the following error: Application with id 'application_1545962730597_2614' doesn't exist in RM. And later looking at the YARN Resource Manager UI at http://<RM_IP_Address>:8088/cluster/apps you can see low Application ID numbers:

Additionally you can look at http://<RM_IP_Address>:8088/cluster/cluster to review the Resource Manager start timestamp:

This means that your YARN Resource Manager was silently restarted in Amazon EMR, and some long-running jobs can be just killed affecting your SLAs.

In my case the YARN Resource Manager process was killed due to out of memory error. You can go to /var/log/hadoop-yarn directory on your EMR master node, and check the Resource Manager STDOUT output file yarn-yarn-resourcemanager-<ip-address>.out.1.gz (since the Resource Manager was already restarted, you need to look at the previous version):

Dec 28, 2018 2:05:32 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver as a provider class
...
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p
kill -9 %p"
#   Executing /bin/sh -c "kill -9 22618
kill -9 22618"...

You can run the following command on the YARN master node to check the Resource Manager memory settings (-Xmx option):

$ ps aux | grep resourcemanager

yarn 35344 14.2 7.6 6835112 4723396 ? Sl  Jan01 677:12 /usr/lib/jvm/java-openjdk/bin/java -Dproc_resourcemanager -Xmx3317m -XX:OnOutOfMemoryError=kill -9 %p -XX:OnOutOfMemoryError=kill -9 %p -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn ... org.apache.hadoop.yarn.server.resourcemanager.ResourceManage

As you can see the Resource Manager was started with -Xmx3317m, but you can change this setting for the next restart using the YARN_RESOURCEMANAGER_HEAPSIZE environment option in /etc/hadoop/conf/yarn-env.sh.

Additionally you can use YARN restart feature available since Hadoop 2.6 that allows you to preserve all jobs that were running during the Resource Manager restart.

You can set the following option in /etc/hadoop/conf/yarn-site.xml file:

<property>
  <name>yarn.resourcemanager.recovery.enabled</name>
  <value>true</value>
</property>

With this option the Resource Manager doesn’t need to kill the applications during a restart, and applications can simply re-sync back with the Resource Manager and resume their work.

Current YARN Resource Manager Heap Usage

Knowing the Resource Manager process ID, you can check the current Java heap usage:

$ sudo jstat -gc 35344

S0C     S1C     S0U    S1U   EC      EU      OC       OU        MC     MU      ...  
102400  104448  0.0    0.0   795136  687850  2265088  2264931   50008  48928   ...

The total heap is combined from Survivor Space 0 and 1 (S0C, S1C), Eden (EC), Old generation (OC) and Metaspace (MC): S0C + S1C + EC + OC + MC = 102400 + 104448 + 795136 + 2265088 + 50008 = 3317m

The used space is S0U + S1U + EU + OU + MU = 0 + 0 + 687850 + 2264931 + 48928 = 3002m (90.5%)