Hadoop YARN – Monitoring Resource Consumption by Running Applications in Multi-Cluster Environments

In cloud it is typical to run multiple compute clusters, so browsing the Web UI for every cluster to check the current resource consumption by applications is not always easy and convenient especially if YARN clusters are managed by different Hadoop distributions (Amazon EMR, Cloudera, Qubole etc.).

Let’s see how you can automate this process and find out how many applications are running and which resources they are consuming (containers, memory and CPU).

Resource Manager REST API

For every cluster you can query http://rm_ip_address:8088/ws/v1/cluster/apps?state=RUNNING to get information about the running applications and used resources.

Sample output:

{
  "apps": {
     "app": [
        {
           "id": "application_1592741179597_5139",
           ... 
           "allocatedMB": 11072,
           "allocatedVCores": 3,
           "runningContainers": 3,
           ...
        },
       ...
    ] } }

You can using the following Python script to quickly get the information about the cluster workloads:

import sys
import requests
import json

# Example:
#   python yarn_running_apps.py "10.33.22.48,10.33.65.19,10.33.100.133"

# Get information about running YARN applications
def yarnRunningApps(clusters):
    for c in clusters:
        url = "http://" + c + ":8088/ws/v1/cluster/apps?state=RUNNING"
        response = requests.get(url, timeout = 60)
        response.raise_for_status()
        apps = response.json()["apps"]
        if apps:
            apps = apps["app"]
            containers = 0
            memory = 0
            vcores = 0
            print("\nCluster: " + c + " - " + str(len(apps)) + " application(s)")
            for a in apps:
                print("\n  %s (%s)" % (a["id"], a["name"]))
                print("\t%d container(s)" % a["runningContainers"])
                print("\t%.1f GB" % (a["allocatedMB"]/1024))
                print("\t%d vCores" % a["allocatedVCores"])
                print("\t%.1f elapsed minutes" % (a["elapsedTime"]/60000))
                containers += a["runningContainers"]
                memory += a["allocatedMB"]
                vcores += a["allocatedVCores"]
            print("\nTotal: %d container(s)\t%.1f GB\t%d vCores" % (containers, memory/1024, vcores))
        else:
            print("\nCluster: " + c + " - No applications running")        
    
if __name__ == '__main__':
    if (len(sys.argv) < 1):
        print("ERROR: Specify list of clusters")
        exit(1)
    clusters = sys.argv[1]
    yarnRunningApps(clusters.split(','))

The script accepts a comma separated list of YARN cluster IPs, and here is its sample output:

Cluster: 10.33.22.48 - 2 application(s)

  application_1592704150573_1268 (HIVE-4d1046fd-2a1d-4b94-9f25-c47c4837a48c)
        416 container(s)
        1248.0 GB
        416 vCores
        0.3 elapsed minutes

  application_1592704150573_1251 (HIVE-1d3f608a-af31-4bcd-97f4-ee11b092faf4)
        3 container(s)
        10.1 GB
        3 vCores
        105.0 elapsed minutes

Total: 419 container(s)    1258.1 GB     419 vCores
...

It can be easily extended to dynamically discover active clusters, show and aggregate more metrics, set various alerts and so on.