Hadoop,  YARN

Hadoop YARN – Calculating Per Second Utilization of Cluster using Resource Manager Logs

You can use YARN REST API to collect various Hadoop cluster metrics such as available and allocated memory, CPU, containers and so on.

If you set up a process to extract data from this API once per minute e.g. you can very easily collect and analyze historical and current cluster utilization quite accurately. For more details, see Collecting Utilization Metrics from Multiple Clusters article.

But even if you query YARN REST API every second it still can only provide a snapshot of the used YARN resources. It does not show which application allocates or releases containers, their memory and CPU capacity, in which order these events occur, what is their exact timestamp and so on.

For this reason I prefer a different approach that is based on using the YARN Resource Manager logs to calculate the exact per second utilization metrics of a Hadoop cluster.

YARN Resource Manager Logs

You can query the following URL at the Hadoop master node to get links to the Resource Manager logs:

http://<cluster_ip>:8088/logs/

There is a log for the current hour as well as compressed logs for the previous hours kept for a couple of days:

yarn-yarn-resourcemanager-<cluster_ip>.log
yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-00.gz 	
yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-01.gz 	
yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-02.gz 	
yarn-yarn-resourcemanager-<cluster_ip>.log.2019-09-08-03.gz 	
...

A Resource Manager log contains a text line per each event:

2019-09-08 17:00:00,798 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode (ResourceManager Event Processor): Assigned container container_1546747261569_328693_01_000002 of capacity <memory:4096, vCores:1> on host <host_ip>:8041, which has 1 containers, <memory:4096, vCores:1> used and <memory:12288, vCores:7> available after allocation
...
2019-09-08 17:00:02,798 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode (IPC Server handler 8 on 8030): Released container container_1546747261569_328693_01_000003 of capacity <memory:4096, vCores:1> on host <host_ip>:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:30720, vCores:16> available, release resources=true

It makes sense to parse the events you are interested in into a Hive table for further analysis:

CREATE EXTERNAL TABLE yarn_rm_log_events (
  ts           string, 
  log_level    string, 
  source_class string, 
  type         string, 
  payload      string
);

In my case it contains the following sample data:

ts log_level source_class type payload
2019-09-08 17:00:00,798 INFO SchedulerNode Assigned container {“application_id”: … }
2019-09-08 17:00:02,798 INFO SchedulerNode Released container {“application_id”: … }

The most interesting part is the payload column:

{
  "application_id" : "application_1546747261569_328693", 
  "container_id" : "container_1546747261569_328693_01_000002", 
  "memory" : 4096, 
  "cores" : 1, 
  "host" : "host_ip:8041", 
  "containers_used" : 1, 
  "memory_used" : 4096, 
  "cores_used" : 1, 
  "memory_avail" : 12288, 
  "cores_avail" : 7
}

Now it is quite straightforward to get the start and end timestamp for each container:

select 
  get_json_object(payload,'$.application_id') app_id,
  get_json_object(payload,'$.container_id') container_id,
  min(case when type = 'Assigned container' then ts else null end) assigned_ts,
  max(case when type = 'Released container' then ts else null end) released_ts,
  max(get_json_object(payload,'$.memory')) memory, 
  max(get_json_object(payload,'$.cores')) cores, 
  max(get_json_object(payload,'$.host')) host
from yarn_rm_log_events
where source_class = 'SchedulerNode'
group by get_json_object(payload,'$.application_id'), get_json_object(payload,'$.container_id');

Then you can write a script to iterate every second from 00:00:00 to 23:59:59 and find out which containers were running that second and how much memory and CPU was consumed.

Here is my sample data:

second num of containers memory used cores used num of distinct applications
00:00:00 1 1024 1 1
00:00:01 2 2048 2 2
00:00:02 4 10240 4 2

The result has 86,400 rows i.e. the number of seconds in 24 hours. So now you know the exact per-second utilization of your cluster.

Utilization of Fixed-Sized Clusters – No Auto-scaling

Using per-second utilization metrics you can calculate daily and hourly cluster utilization.

Let’s see how we can use this data to calculate the daily utilization for fixed-sized clusters i.e. clusters with the constant number of nodes.

                     sum(memory used)
Memory utilization = ----------------------------- * 100%
                     total cluster memory * 86,400 

The same is for CPU cores:

                  sum(cores used)
CPU utilization = ---------------------------- * 100%
                  total cluster cores * 86,400 

Besides simple averages you can also compute percentiles and apply various time-series analysis models and techniques.

The YARN Resource Manager logs can be very helpful to analyze the cluster utilization as well as troubleshoot various clusters and application issues.