AWS,  EMR,  Hadoop,  YARN

Amazon EMR – Recovering Unhealthy Nodes with EMR Services Down

Usually Hadoop is able to automatically recover cluster nodes from Unhealthy state by cleaning log and temporary directories. But sometimes nodes stay unhealthy for a long time and manual intervention is necessary to bring them back.

In one Hadoop cluster I found a node that has been running in unhealthy state for many days:

The Unhealthy state means that node is reachable, it runs the YARN NodeManager but it can not be used to schedule task execution (run YARN containers) for various reasons. In my case the log message shows that there is no enough disk space on the node.

Connecting to the node I see that /mnt1 and /var have enough space while /emr is full:

$ ssh -i "private_key_file" root@ip_address

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/xvda1       50G  6.6G   43G  14% /
/dev/xvdb1      5.0G  5.0G   20K 100% /emr
/dev/xvdc       153G  9.3G  144G   7% /mnt1

/emr directory is used by the EMR services managing the node:

$ du -sh /emr/*
16K     /emr/apppusher
143M    /emr/instance-controller
565M    /emr/instance-state
4.1G    /emr/logpusher
177M    /emr/service-nanny
56K     /emr/setup-devices

du -sh /emr/logpusher/*
8.0K    /emr/logpusher/db
0       /emr/logpusher/lib
4.1G    /emr/logpusher/log
4.0K    /emr/logpusher/run

For some reason logpusher was unable to rotate and clean its logs, so I removed them manually.

Then I noticed that the EMR services were not running on this node:

$ sudo /etc/init.d/instance-controller status
Not Running                                                [WARNING]

$ sudo /etc/init.d/service-nanny status
Not Running                                                [WARNING]

$ sudo /etc/init.d/logpusher status
Not Running                                                [WARNING]

So I had to start them manually (replacing status with start in the commands above). The last step was to restart the NodeManager:

$ sudo stop hadoop-yarn-nodemanager
$ sudo yarn nodemanager & 

And now the node is back.