AWS,  EMR,  Hadoop,  YARN

Hadoop YARN – Collecting Utilization Metrics from Multiple Clusters

When you run many Hadoop clusters it is useful to automatically collect metrics from all clusters in a single place (Hive table i.e.).

This allows you to perform any advanced and custom analysis of your clusters workload and not be limited to the features provided by Hadoop Administration UI tools that often offer only per cluster view so it is hard to see the whole picture of your data platform.

The first step is to dynamically get the list of clusters and their IPs. Hadoop clusters are often reprovisioned, added and terminated, so you cannot use the static list and addresses. In case of Amazon EMR, you can use the following Linux shell command to get the list of active clusters:

aws emr list-clusters --active

From its output you can get the cluster IDs and names. As a cluster ID and IP can change over time, its name is usually permanent (like DEV or Adhoc-Analytics cluster) so it can be useful for various aggregation reports.

Then for each cluster you can get its MasterPublicDnsName and the private IP by running:

-- Get MasterPublicDnsName
aws emr describe-cluster --cluster-id=<ID>

-- Get private IP
host <MasterPublicDnsName>

After you have a list of Hadoop YARN clusters and their IPs you can query them to get the cluster metrics:

curl http://<cluster_ip>:8088/ws/v1/cluster/metrics

Sample output:

{ "clusterMetrics": {
        "appsSubmitted": 180739,
        "appsCompleted": 176616,
        "appsPending": 0,
        "appsRunning": 0,
        "appsFailed": 323,
        "appsKilled": 3800,
        "reservedMB": 0,
        "availableMB": 3363840,
        "allocatedMB": 0,
        "reservedVirtualCores": 0,
        "availableVirtualCores": 2419,
        "allocatedVirtualCores": 0,
        "containersAllocated": 0,
        "containersReserved": 0,
        "containersPending": 0,
        "totalMB": 3363840,
        "totalVirtualCores": 2419,
        "totalNodes": 230,
        "lostNodes": 84,
        "unhealthyNodes": 0,
        "decommissionedNodes": 0,
        "decommissioningNodes": 0,
        "rebootedNodes": 0,
        "activeNodes": 146
    }
}

I have an ETL process that extracts these metrics for every Hadoop cluster every minute and puts into a Hive table yarn_clusters_state:

CREATE EXTERNAL TABLE yarn_clusters_state (
  name     STRING,
  event_ts STRING,
  metrics  STRING,
  ip       STRING
)
STORED AS TEXTFILE
LOCATION 's3://cloudsqale/hive/warehouse.db/yarn_clusters_state';

This table contains time series data for all Hadoop clusters, for example:

PROD-01   2019-05-15T00:01:04.323   {"allocatedVirtualCores":1380,"appsPending":0, ...
PROD-02   2019-05-15T00:01:04.332   {"allocatedVirtualCores":0,"appsPending":0, ...
PROD-03   2019-05-15T00:01:04.341   {"allocatedVirtualCores":2432,"appsPending":3, ...
PROD-04   2019-05-15T00:01:04.356   {"allocatedVirtualCores":3471,"appsPending":2, ...

Then you can get summary statistics about completed and running applications as well as maximum and average memory usage as follows:

select
  name, 
  event_dt,
  max(appsCompleted) - min(appsCompleted) as apps_completed, 
  max(appsRunning)                        as max_apps,  
  avg(appsRunning)                        as avg_apps,
  max((totalMB - availableMB)/totalMB)    as max_memory,
  avg((totalMB - availableMB)/totalMB)    as avg_memory
from 
(
  select
    name, 
    substring(event_ts, 1, 10)                               as event_dt,
    cast(get_json_object(metrics, "$.appsRunning") as int)   as appsRunning,
    cast(get_json_object(metrics, "$.appsCompleted") as int) as appsCompleted,
    cast(get_json_object(metrics, "$.availableMB") as int)   as availableMB,
    cast(get_json_object(metrics, "$.totalMB") as int)       as totalMB
  from yarn_clusters_state
  where substring(event_ts, 1, 10) >= '2019-05-01'
) cs
group by name, event_dt;

Sample output:

name event_dt apps_completed max_apps avg_apps max_memory avg_memory
PROD-01 2019-05-15 5458 28 7.1 0.76 0.14
PROD-02 2019-05-15 845 19 1.7 0.99 0.18
PROD-03 2019-05-15 4520 45 16.4 1.00 0.25
PROD-04 2019-05-15 782 72 37.5 0.99 0.19

You can collect similar reports for CPU and YARN containers utilization, get hourly or 10-minute summaries, and build any other custom reports.