You can use YARN REST API to collect various Hadoop cluster metrics such as available and allocated memory, CPU, containers and so on.
If you set up a process to extract data from this API once per minute e.g. you can very easily collect and analyze historical and current cluster utilization quite accurately. For more details, see Collecting Utilization Metrics from Multiple Clusters article.
But even if you query YARN REST API every second it still can only provide a snapshot of the used YARN resources. It does not show which application allocates or releases containers, their memory and CPU capacity, in which order these events occur, what is their exact timestamp and so on.
For this reason I prefer a different approach that is based on using the YARN Resource Manager logs to calculate the exact per second utilization metrics of a Hadoop cluster.
YARN Resource Manager Logs
You can query the following URL at the Hadoop master node to get links to the Resource Manager logs:
http://<cluster_ip>:8088/logs/
There is a log for the current hour as well as compressed logs for the previous hours kept for a couple of days:
yarn-yarn-resourcemanager-<cluster_ip>.log yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-00.gz yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-01.gz yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-02.gz yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-03.gz ...
A Resource Manager log contains a text line per each event:
2019-09-08 17:00:00,798 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode (ResourceManager Event Processor): Assigned container container_1546747261569_328693_01_000002 of capacity <memory:4096, vCores:1> on host <host_ip>:8041, which has 1 containers, <memory:4096, vCores:1> used and <memory:12288, vCores:7> available after allocation ... 2019-09-08 17:00:02,798 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode (IPC Server handler 8 on 8030): Released container container_1546747261569_328693_01_000003 of capacity <memory:4096, vCores:1> on host <host_ip>:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:30720, vCores:16> available, release resources=true
It makes sense to parse the events you are interested in into a Hive table for further analysis:
CREATE EXTERNAL TABLE yarn_rm_log_events ( ts string, log_level string, source_class string, type string, payload string );
In my case it contains the following sample data:
ts | log_level | source_class | type | payload |
---|---|---|---|---|
2019-09-08 17:00:00,798 | INFO | SchedulerNode | Assigned container | {“application_id”: … } |
2019-09-08 17:00:02,798 | INFO | SchedulerNode | Released container | {“application_id”: … } |
The most interesting part is the payload
column:
{ "application_id" : "application_1546747261569_328693", "container_id" : "container_1546747261569_328693_01_000002", "memory" : 4096, "cores" : 1, "host" : "host_ip:8041", "containers_used" : 1, "memory_used" : 4096, "cores_used" : 1, "memory_avail" : 12288, "cores_avail" : 7 }
Now it is quite straightforward to get the start and end timestamp for each container:
select get_json_object(payload,'$.application_id') app_id, get_json_object(payload,'$.container_id') container_id, min(case when type = 'Assigned container' then ts else null end) assigned_ts, max(case when type = 'Released container' then ts else null end) released_ts, max(get_json_object(payload,'$.memory')) memory, max(get_json_object(payload,'$.cores')) cores, max(get_json_object(payload,'$.host')) host from yarn_rm_log_events where source_class = 'SchedulerNode' group by get_json_object(payload,'$.application_id'), get_json_object(payload,'$.container_id');
Then you can write a script to iterate every second from 00:00:00 to 23:59:59 and find out which containers were running that second and how much memory and CPU was consumed.
Here is my sample data:
second | num of containers | memory used | cores used | num of distinct applications |
---|---|---|---|---|
00:00:00 | 1 | 1024 | 1 | 1 |
00:00:01 | 2 | 2048 | 2 | 2 |
00:00:02 | 4 | 10240 | 4 | 2 |
The result has 86,400 rows i.e. the number of seconds in 24 hours. So now you know the exact per-second utilization of your cluster.
Utilization of Fixed-Sized Clusters – No Auto-scaling
Using per-second utilization metrics you can calculate daily and hourly cluster utilization.
Let’s see how we can use this data to calculate the daily utilization for fixed-sized clusters i.e. clusters with the constant number of nodes.
sum(memory used) Memory utilization = ----------------------------- * 100% total cluster memory * 86,400
The same is for CPU cores:
sum(cores used) CPU utilization = ---------------------------- * 100% total cluster cores * 86,400
Besides simple averages you can also compute percentiles and apply various time-series analysis models and techniques.
The YARN Resource Manager logs can be very helpful to analyze the cluster utilization as well as troubleshoot various clusters and application issues.