In the previous article Calculating Utilization of Cluster using Resource Manager Logs I showed how to estimate per-second utilization for a Hadoop cluster.
This information can be useful to calculate the idle time statistics for a cluster i.e. time when no any containers are running.
When we have time series data how many containers were running for each second of the day:
second of day containers_running ------------- ------------------ 00:00:00 2450 00:00:01 2450 00:00:02 2450 ... 23:59:57 1400 23:59:58 1400 23:59:59 1400
We can find out seconds and time gaps when the cluster did not run any containers, for example, using the following Python script:
import datetime import pandas as pd sec_file = open("yarn_rm_sec.txt") start_dt = datetime.datetime.strptime("00:00:00", "%H:%M:%S") secs_all = 0 idle_periods = 0 secs_list = [] # For every second from 00:00:00 to 23:59:59 while True: cur_sec = start_dt.strftime("%H:%M:%S") line = sec_file.readline() if line: yarn_sec = line.split('\t')[0] else: yarn_sec = "23:59:59" start_gap = cur_sec secs = 0 # If no containers were running this second while cur_sec < yarn_sec: start_dt = start_dt + datetime.timedelta(seconds = 1) cur_sec = start_dt.strftime("%H:%M:%S") if secs == 0: idle_periods += 1 secs += 1 secs_all += 1 if secs > 0: secs_list.append(secs) end_gap = cur_sec # Check if a gap exists if start_gap <> end_gap: print(start_gap + "\t" + end_gap + "\t" + str(secs) + "\t" + str(secs_all)) if cur_sec == "23:59:59": break start_dt = start_dt + datetime.timedelta(seconds = 1) df = pd.DataFrame({'secs' : secs_list}) print("Stats:") print(" Idle periods: " + str(idle_periods)) print(" Total idle seconds: " + str(secs_all)) print(" Total idle: " + str(secs_all*100/86400) + "%") print(" Min idle time: " + str(df['secs'].min())) print(" Max idle time: " + str(df['secs'].max())) print(" Median idle time: " + str(df['secs'].median())) print(" Top 10 max idle times: " + df['secs'].nlargest(10).to_string(index=False).replace('\n', '')) print(" Top 10 min idle times: " + df['secs'].nsmallest(10).to_string(index=False).replace('\n', ''))
For my case I got the following result:
idle_start idle_end seconds total_seconds_per_day ---------- -------- ---- ------------------- 00:12:48 00:14:22 94 94 00:14:49 00:20:23 334 428 00:22:42 00:25:20 158 586 00:26:08 00:32:45 397 983 00:33:38 00:33:41 3 986 00:33:52 00:40:28 396 1382 09:22:58 09:25:34 156 1538 09:34:28 09:40:35 367 1905 09:42:38 09:47:15 277 2182 09:47:42 09:52:24 282 2464 09:54:14 09:54:25 11 2475 10:00:11 10:00:41 30 2505 10:05:49 10:10:35 286 2791 10:11:53 10:11:59 6 2797 10:12:26 10:20:58 512 3309 10:22:51 10:25:35 164 3473 10:32:40 10:32:58 18 3491 10:33:50 10:33:53 3 3494 10:34:04 10:40:36 392 3886 10:42:49 10:46:07 198 4084 10:46:35 10:51:16 281 4365 10:53:07 10:53:18 11 4376 10:59:58 11:00:23 25 4401 11:17:13 11:20:19 186 4587 11:39:01 11:40:21 80 4667 11:42:43 11:45:50 187 4854 11:46:21 11:50:56 275 5129 11:52:46 11:52:58 12 5141 12:00:04 12:00:18 14 5155 12:40:15 12:40:20 5 5160 12:42:19 12:45:53 214 5374 12:46:22 12:51:05 283 5657 12:52:56 12:53:07 11 5668 12:59:38 13:00:26 48 5716 13:08:49 13:10:18 89 5805 13:12:03 13:20:22 499 6304 13:38:48 13:40:21 93 6397 13:42:58 13:45:54 176 6573 13:46:23 13:51:06 283 6856 13:52:50 13:53:03 13 6869 14:08:00 14:08:49 49 6918 14:09:20 14:10:20 60 6978 14:11:38 14:20:22 524 7502 14:22:28 14:25:20 172 7674 14:36:45 14:40:22 217 7891 14:52:56 14:53:08 12 7903 15:22:54 15:25:25 151 8054 15:33:55 15:40:29 394 8448 16:08:22 16:09:35 73 8521 16:10:06 16:10:27 21 8542 16:11:40 16:20:29 529 9071 16:22:16 16:25:27 191 9262 17:09:11 17:10:29 78 9340 17:13:12 17:20:31 439 9779 17:22:25 17:25:28 183 9962 17:26:13 17:32:53 400 10362 17:33:46 17:33:48 2 10364 17:33:59 17:40:35 396 10760 18:23:23 18:23:25 2 10762 18:23:26 18:25:31 125 10887 18:26:19 18:32:56 397 11284 18:33:49 18:33:51 2 11286 18:34:02 18:40:37 395 11681 19:12:21 19:20:37 496 12177 19:22:35 19:25:34 179 12356 19:27:21 19:33:00 339 12695 19:33:52 19:33:55 3 12698 19:34:06 19:40:39 393 13091 20:15:48 20:20:39 291 13382 20:23:01 20:25:35 154 13536 20:26:21 20:32:59 398 13934 20:34:13 20:40:39 386 14320 21:04:22 21:23:01 1119 15439 21:45:32 21:45:33 1 15440 21:45:35 21:46:51 76 15516 21:47:06 21:51:45 279 15795 21:53:33 21:53:45 12 15807 22:30:40 22:32:40 120 15927 22:33:31 22:33:34 3 15930 22:33:45 22:40:21 396 16326 23:12:22 23:15:43 201 16527 23:16:34 23:20:20 226 16753 23:22:37 23:25:17 160 16913 23:26:04 23:32:42 398 17311 23:34:26 23:40:23 357 17668 Stats: Idle periods: 85 Total idle seconds: 17668 Total idle: 20% Min idle time: 1 Max idle time: 1119 Median idle time: 179.0 Top 10 max idle times: 1119 529 524 512 499 496 439 400 398 398 Top 10 min idle times: 1 2 2 2 3 3 3 3 5 6
The cluster was idle 17,668 seconds (17,668/86,400 = 20% of day) but…
There were 85 short periods when the cluster was idle but the largest period is just about 20 minutes:
21:04:22 21:23:01 1119 15439
You can see that there is no much opportunity to turn the cluster off to save 20% as the cluster runs some workload all 24 hours of the day.
But auto scaling still can be helpful to reduce the cluster cost and I will investigate this topic in future articles.