S3 Monitoring Step #2 – Read Operations – Large-Scale Data Engineering in Cloud

After you get the first impression about your S3 buckets by looking at their size, number of objects and daily growth rate (see Step #1), it is time to investigate I/O operations in detail.

Let’s start with read operations. Now we need to use S3 Access Logs to get the detailed information about all performed S3 REST.GET.OBJECT operations.

Number of Read Operations and Volume of Data Read

I have a Hive table s3_access_logs created on top of S3 logs, and I can query it as follows:

select event_dt, count(*) all_requests, count(distinct key) dist_keys, 
  count(distinct remote_ip) dist_hosts,
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read,
  sum(bytes_sent)/(1024*1024*1024)/86400 avg_gigabytes_sec 
from s3_access_logs 
where event_dt between '2018-10-01' and '2018-10-03' and operation = 'REST.GET.OBJECT'
group by event_dt;

And here is my sample output:

event_dt	all_requests	dist_keys	dist_hosts	terabytes_read	avg_gigabytes_sec
2018-10-01	27,880,357	8,059,021	4,165	1,838.7	21.3
2018-10-02	22,093,902	8,073,974	3,969	1,784.3	20.6
2018-10-03	35,278,179	11,512,215	4,361	2,088.0	24.2

In this sample, all_requests shows the total number of REST.GET.OBJECT requests (there can be multiple requests for reading a single file, and this is especially true for reading ORC/Parquet files), dist_keys shows the number of unique objects (files) requested, and dist_host tells us about unique hosts that made requests (mostly EC2 instances from various Hadoop clusters).

Finally terabytes_read shows how much data were read from S3 daily and avg_gigabytes_sec tells us about the average transfer rate per second (!) for read operations in gigabytes for every day.

Read of Unique Objects

Knowing the number of unique objects (files) read per day, let’s see their total size:

select event_dt, count(*) dist_keys, 
  sum(total_size)/(cast(1024 as bigint)*1024*1024*1024) dist_keys_terabytes
from (
  select event_dt, key, max(total_size) total_size
  from s3_access_logs 
  where event_dt between '2018-10-01' and '2018-10-03' and operation='REST.GET.OBJECT'
  group by event_dt, key) t
group by event_dt;

The sample results is as follows:

event_dt	dist_keys	dist_keys_terabytes
2018-10-01	8,059,021	424.2
2018-10-02	8,073,974	379.0
2018-10-03	11,512,215	523.9

Here dist_keys_terabytes show the total size of the unique objects that were read daily. This size is a quite good approximation of unique data volume that is required to read.

Of course, it is not exact as, for example, ORC files can be read partially. I also assume that objects do not change, and it is often true for data warehouse workloads when typically new data are only added, and the existing data are not changed.

So we see that out of 1.8 PB data only 424.2 TB represent the unique content, this means that the same data were read multiple times from S3. It is a good question whether we can effectively cache it somehow.

Now let’s see if unique objects change over time:

select count(*) keys, 
  sum(total_size)/(cast(1024 as bigint)*1024*1024*1024) dist_key_terabytes
from (
  select  key, max(total_size) total_size
  from s3_access_logs 
  where event_dt between '2018-10-01' and '2018-10-03' and operation='REST.GET.OBJECT'
  group by key) t;

The output:

dist_keys	dist_keys_terabytes
15,813,552	966.5

You can see that for 3 days of data objects that are read from S3 are quite unique, it is not the same data read every day.

Object Size

It is also helpful to review read operations based on the object size bins:

select event_dt, size_type, count(*) keys, 
  count(distinct key) dist_keys, 
  sum(total_time_ms/1000) time_spent,
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
  select 
    event_dt, key,
    case 
      when total_size <= 1024*1024 then '<= 1 MB'
      when total_size <= 30*1024*1024 then '<= 30 MB'
      when total_size <= 100*1024*1024 then '<= 100 MB'
      else '> 100 MB'
    end size_type,
    bytes_sent,
    total_time_ms
  from s3_access_logs 
  where event_dt = '2018-10-01' and operation='REST.GET.OBJECT') t
group by event_dt, size_type;

Sample output:

event_dt	size_type	keys	dist_keys	time_spent	terabytes_read
2018-10-01	<= 1 MB	10,294,338	4,076,071	553,798	2.6
2018-10-01	<= 30 MB	8,080,147	2,546,482	10,765,912	49.0
2018-10-01	<= 100 MB	1,287,447	443,582	19,909,038	72.1
2018-10-01	> 100 MB	8,218,425	992,914	474,555,730	1,714.9

Here we can see that most read operations are performed for large files and this is typical for data warehouse workloads.