Hadoop YARN Cluster Idle Time – Large-Scale Data Engineering in Cloud

In the previous article Calculating Utilization of Cluster using Resource Manager Logs I showed how to estimate per-second utilization for a Hadoop cluster.

This information can be useful to calculate the idle time statistics for a cluster i.e. time when no any containers are running.

When we have time series data how many containers were running for each second of the day:

second of day  containers_running 
-------------  ------------------
00:00:00       2450
00:00:01       2450
00:00:02       2450
...
23:59:57       1400
23:59:58       1400
23:59:59       1400

We can find out seconds and time gaps when the cluster did not run any containers, for example, using the following Python script:

import datetime
import pandas as pd 

sec_file = open("yarn_rm_sec.txt")
start_dt = datetime.datetime.strptime("00:00:00", "%H:%M:%S")
secs_all = 0
idle_periods = 0
secs_list = []    

# For every second from 00:00:00 to 23:59:59
while True:
    cur_sec = start_dt.strftime("%H:%M:%S")
    line = sec_file.readline()    
    if line:
       yarn_sec = line.split('\t')[0]
    else:
       yarn_sec = "23:59:59"
       
    start_gap = cur_sec
    secs = 0    
       
    # If no containers were running this second   
    while cur_sec < yarn_sec:
        start_dt = start_dt + datetime.timedelta(seconds = 1)
        cur_sec = start_dt.strftime("%H:%M:%S")
        if secs == 0:
            idle_periods += 1
        secs += 1
        secs_all += 1
        
    if secs > 0:
        secs_list.append(secs)    

    end_gap = cur_sec

    # Check if a gap exists 
    if start_gap <> end_gap:
        print(start_gap + "\t" + end_gap + "\t" + str(secs) + "\t" + str(secs_all))    

    if cur_sec == "23:59:59":
        break
    start_dt = start_dt + datetime.timedelta(seconds = 1)
    
df = pd.DataFrame({'secs' : secs_list}) 
    
print("Stats:")
print("  Idle periods: " + str(idle_periods))
print("  Total idle seconds: " + str(secs_all))
print("  Total idle: " + str(secs_all*100/86400) + "%")
print("  Min idle time: " + str(df['secs'].min()))
print("  Max idle time: " + str(df['secs'].max()))
print("  Median idle time: " + str(df['secs'].median()))
print("  Top 10 max idle times: " + df['secs'].nlargest(10).to_string(index=False).replace('\n', ''))
print("  Top 10 min idle times: " + df['secs'].nsmallest(10).to_string(index=False).replace('\n', ''))

For my case I got the following result:

idle_start      idle_end        seconds   total_seconds_per_day
----------      --------        ----      -------------------
00:12:48	00:14:22	94	94
00:14:49	00:20:23	334	428
00:22:42	00:25:20	158	586
00:26:08	00:32:45	397	983
00:33:38	00:33:41	3	986
00:33:52	00:40:28	396	1382
09:22:58	09:25:34	156	1538
09:34:28	09:40:35	367	1905
09:42:38	09:47:15	277	2182
09:47:42	09:52:24	282	2464
09:54:14	09:54:25	11	2475
10:00:11	10:00:41	30	2505
10:05:49	10:10:35	286	2791
10:11:53	10:11:59	6	2797
10:12:26	10:20:58	512	3309
10:22:51	10:25:35	164	3473
10:32:40	10:32:58	18	3491
10:33:50	10:33:53	3	3494
10:34:04	10:40:36	392	3886
10:42:49	10:46:07	198	4084
10:46:35	10:51:16	281	4365
10:53:07	10:53:18	11	4376
10:59:58	11:00:23	25	4401
11:17:13	11:20:19	186	4587
11:39:01	11:40:21	80	4667
11:42:43	11:45:50	187	4854
11:46:21	11:50:56	275	5129
11:52:46	11:52:58	12	5141
12:00:04	12:00:18	14	5155
12:40:15	12:40:20	5	5160
12:42:19	12:45:53	214	5374
12:46:22	12:51:05	283	5657
12:52:56	12:53:07	11	5668
12:59:38	13:00:26	48	5716
13:08:49	13:10:18	89	5805
13:12:03	13:20:22	499	6304
13:38:48	13:40:21	93	6397
13:42:58	13:45:54	176	6573
13:46:23	13:51:06	283	6856
13:52:50	13:53:03	13	6869
14:08:00	14:08:49	49	6918
14:09:20	14:10:20	60	6978
14:11:38	14:20:22	524	7502
14:22:28	14:25:20	172	7674
14:36:45	14:40:22	217	7891
14:52:56	14:53:08	12	7903
15:22:54	15:25:25	151	8054
15:33:55	15:40:29	394	8448
16:08:22	16:09:35	73	8521
16:10:06	16:10:27	21	8542
16:11:40	16:20:29	529	9071
16:22:16	16:25:27	191	9262
17:09:11	17:10:29	78	9340
17:13:12	17:20:31	439	9779
17:22:25	17:25:28	183	9962
17:26:13	17:32:53	400	10362
17:33:46	17:33:48	2	10364
17:33:59	17:40:35	396	10760
18:23:23	18:23:25	2	10762
18:23:26	18:25:31	125	10887
18:26:19	18:32:56	397	11284
18:33:49	18:33:51	2	11286
18:34:02	18:40:37	395	11681
19:12:21	19:20:37	496	12177
19:22:35	19:25:34	179	12356
19:27:21	19:33:00	339	12695
19:33:52	19:33:55	3	12698
19:34:06	19:40:39	393	13091
20:15:48	20:20:39	291	13382
20:23:01	20:25:35	154	13536
20:26:21	20:32:59	398	13934
20:34:13	20:40:39	386	14320
21:04:22	21:23:01	1119	15439
21:45:32	21:45:33	1	15440
21:45:35	21:46:51	76	15516
21:47:06	21:51:45	279	15795
21:53:33	21:53:45	12	15807
22:30:40	22:32:40	120	15927
22:33:31	22:33:34	3	15930
22:33:45	22:40:21	396	16326
23:12:22	23:15:43	201	16527
23:16:34	23:20:20	226	16753
23:22:37	23:25:17	160	16913
23:26:04	23:32:42	398	17311
23:34:26	23:40:23	357	17668

Stats:
  Idle periods: 85
  Total idle seconds: 17668
  Total idle: 20%
  Min idle time: 1
  Max idle time: 1119
  Median idle time: 179.0
  Top 10 max idle times:  1119  529  524  512  499  496  439  400  398  398
  Top 10 min idle times:  1 2 2 2 3 3 3 3 5 6

The cluster was idle 17,668 seconds (17,668/86,400 = 20% of day) but…

There were 85 short periods when the cluster was idle but the largest period is just about 20 minutes:

21:04:22	21:23:01	1119	15439

You can see that there is no much opportunity to turn the cluster off to save 20% as the cluster runs some workload all 24 hours of the day.

But auto scaling still can be helpful to reduce the cluster cost and I will investigate this topic in future articles.