When you run many Hadoop clusters it is useful to automatically collect metrics from all clusters in a single place (Hive table i.e.).
This allows you to perform any advanced and custom analysis of your clusters workload and not be limited to the features provided by Hadoop Administration UI tools that often offer only per cluster view so it is hard to see the whole picture of your data platform.
The first step is to dynamically get the list of clusters and their IPs. Hadoop clusters are often reprovisioned, added and terminated, so you cannot use the static list and addresses. In case of Amazon EMR, you can use the following Linux shell command to get the list of active clusters:
aws emr list-clusters --active
From its output you can get the cluster IDs and names. As a cluster ID and IP can change over time, its name is usually permanent (like DEV
or Adhoc-Analytics
cluster) so it can be useful for various aggregation reports.
Then for each cluster you can get its MasterPublicDnsName
and the private IP by running:
-- Get MasterPublicDnsName aws emr describe-cluster --cluster-id=<ID> -- Get private IP host <MasterPublicDnsName>
After you have a list of Hadoop YARN clusters and their IPs you can query them to get the cluster metrics:
curl http://<cluster_ip>:8088/ws/v1/cluster/metrics
Sample output:
{ "clusterMetrics": { "appsSubmitted": 180739, "appsCompleted": 176616, "appsPending": 0, "appsRunning": 0, "appsFailed": 323, "appsKilled": 3800, "reservedMB": 0, "availableMB": 3363840, "allocatedMB": 0, "reservedVirtualCores": 0, "availableVirtualCores": 2419, "allocatedVirtualCores": 0, "containersAllocated": 0, "containersReserved": 0, "containersPending": 0, "totalMB": 3363840, "totalVirtualCores": 2419, "totalNodes": 230, "lostNodes": 84, "unhealthyNodes": 0, "decommissionedNodes": 0, "decommissioningNodes": 0, "rebootedNodes": 0, "activeNodes": 146 } }
I have an ETL process that extracts these metrics for every Hadoop cluster every minute and puts into a Hive table yarn_clusters_state
:
CREATE EXTERNAL TABLE yarn_clusters_state ( name STRING, event_ts STRING, metrics STRING, ip STRING ) STORED AS TEXTFILE LOCATION 's3://cloudsqale/hive/warehouse.db/yarn_clusters_state';
This table contains time series data for all Hadoop clusters, for example:
PROD-01 2019-05-15T00:01:04.323 {"allocatedVirtualCores":1380,"appsPending":0, ... PROD-02 2019-05-15T00:01:04.332 {"allocatedVirtualCores":0,"appsPending":0, ... PROD-03 2019-05-15T00:01:04.341 {"allocatedVirtualCores":2432,"appsPending":3, ... PROD-04 2019-05-15T00:01:04.356 {"allocatedVirtualCores":3471,"appsPending":2, ...
Then you can get summary statistics about completed and running applications as well as maximum and average memory usage as follows:
select name, event_dt, max(appsCompleted) - min(appsCompleted) as apps_completed, max(appsRunning) as max_apps, avg(appsRunning) as avg_apps, max((totalMB - availableMB)/totalMB) as max_memory, avg((totalMB - availableMB)/totalMB) as avg_memory from ( select name, substring(event_ts, 1, 10) as event_dt, cast(get_json_object(metrics, "$.appsRunning") as int) as appsRunning, cast(get_json_object(metrics, "$.appsCompleted") as int) as appsCompleted, cast(get_json_object(metrics, "$.availableMB") as int) as availableMB, cast(get_json_object(metrics, "$.totalMB") as int) as totalMB from yarn_clusters_state where substring(event_ts, 1, 10) >= '2019-05-01' ) cs group by name, event_dt;
Sample output:
name | event_dt | apps_completed | max_apps | avg_apps | max_memory | avg_memory |
---|---|---|---|---|---|---|
PROD-01 | 2019-05-15 | 5458 | 28 | 7.1 | 0.76 | 0.14 |
PROD-02 | 2019-05-15 | 845 | 19 | 1.7 | 0.99 | 0.18 |
PROD-03 | 2019-05-15 | 4520 | 45 | 16.4 | 1.00 | 0.25 |
PROD-04 | 2019-05-15 | 782 | 72 | 37.5 | 0.99 | 0.19 |
You can collect similar reports for CPU and YARN containers utilization, get hourly or 10-minute summaries, and build any other custom reports.