Usually Hadoop is able to automatically recover cluster nodes from Unhealthy state by cleaning log and temporary directories. But sometimes nodes stay unhealthy for a long time and manual intervention is necessary to bring them back.
In one Hadoop cluster I found a node that has been running in unhealthy state for many days:
The Unhealthy state means that node is reachable, it runs the
YARN NodeManager but it can not be used to schedule task execution (run YARN containers) for various reasons. In my case the log message shows that there is no enough disk space on the node.
Connecting to the node I see that
/var have enough space while
/emr is full:
$ ssh -i "private_key_file" root@ip_address $ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/xvda1 50G 6.6G 43G 14% / /dev/xvdb1 5.0G 5.0G 20K 100% /emr /dev/xvdc 153G 9.3G 144G 7% /mnt1
/emr directory is used by the EMR services managing the node:
$ du -sh /emr/* 16K /emr/apppusher 143M /emr/instance-controller 565M /emr/instance-state 4.1G /emr/logpusher 177M /emr/service-nanny 56K /emr/setup-devices du -sh /emr/logpusher/* 8.0K /emr/logpusher/db 0 /emr/logpusher/lib 4.1G /emr/logpusher/log 4.0K /emr/logpusher/run
For some reason
logpusher was unable to rotate and clean its logs, so I removed them manually.
Then I noticed that the EMR services were not running on this node:
$ sudo /etc/init.d/instance-controller status Not Running [WARNING] $ sudo /etc/init.d/service-nanny status Not Running [WARNING] $ sudo /etc/init.d/logpusher status Not Running [WARNING]
So I had to start them manually (replacing
start in the commands above). The last step was to restart the
$ sudo stop hadoop-yarn-nodemanager $ sudo yarn nodemanager &
And now the node is back.