After you get the first impression about your S3 buckets by looking at their size, number of objects and daily growth rate (see Step #1), it is time to investigate I/O operations in detail.
Let’s start with read operations. Now we need to use S3 Access Logs to get the detailed information about all performed S3
Number of Read Operations and Volume of Data Read
I have a Hive table
s3_access_logs created on top of S3 logs, and I can query it as follows:
select event_dt, count(*) all_requests, count(distinct key) dist_keys, count(distinct remote_ip) dist_hosts, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read, sum(bytes_sent)/(1024*1024*1024)/86400 avg_gigabytes_sec from s3_access_logs where event_dt between '2018-10-01' and '2018-10-03' and operation = 'REST.GET.OBJECT' group by event_dt;
And here is my sample output:
In this sample,
all_requests shows the total number of
REST.GET.OBJECT requests (there can be multiple requests for reading a single file, and this is especially true for reading ORC/Parquet files),
dist_keys shows the number of unique objects (files) requested, and
dist_host tells us about unique hosts that made requests (mostly EC2 instances from various Hadoop clusters).
terabytes_read shows how much data were read from S3 daily and
avg_gigabytes_sec tells us about the average transfer rate per second (!) for read operations in gigabytes for every day.
Read of Unique Objects
Knowing the number of unique objects (files) read per day, let’s see their total size:
select event_dt, count(*) dist_keys, sum(total_size)/(cast(1024 as bigint)*1024*1024*1024) dist_keys_terabytes from ( select event_dt, key, max(total_size) total_size from s3_access_logs where event_dt between '2018-10-01' and '2018-10-03' and operation='REST.GET.OBJECT' group by event_dt, key) t group by event_dt;
The sample results is as follows:
dist_keys_terabytes show the total size of the unique objects that were read daily. This size is a quite good approximation of unique data volume that is required to read.
Of course, it is not exact as, for example, ORC files can be read partially. I also assume that objects do not change, and it is often true for data warehouse workloads when typically new data are only added, and the existing data are not changed.
So we see that out of 1.8 PB data only 424.2 TB represent the unique content, this means that the same data were read multiple times from S3. It is a good question whether we can effectively cache it somehow.
Now let’s see if unique objects change over time:
select count(*) keys, sum(total_size)/(cast(1024 as bigint)*1024*1024*1024) dist_key_terabytes from ( select key, max(total_size) total_size from s3_access_logs where event_dt between '2018-10-01' and '2018-10-03' and operation='REST.GET.OBJECT' group by key) t;
You can see that for 3 days of data objects that are read from S3 are quite unique, it is not the same data read every day.
It is also helpful to review read operations based on the object size bins:
select event_dt, size_type, count(*) keys, count(distinct key) dist_keys, sum(total_time_ms/1000) time_spent, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read from ( select event_dt, key, case when total_size <= 1024*1024 then '<= 1 MB' when total_size <= 30*1024*1024 then '<= 30 MB' when total_size <= 100*1024*1024 then '<= 100 MB' else '> 100 MB' end size_type, bytes_sent, total_time_ms from s3_access_logs where event_dt = '2018-10-01' and operation='REST.GET.OBJECT') t group by event_dt, size_type;
|2018-10-01||<= 1 MB||10,294,338||4,076,071||553,798||2.6|
|2018-10-01||<= 30 MB||8,080,147||2,546,482||10,765,912||49.0|
|2018-10-01||<= 100 MB||1,287,447||443,582||19,909,038||72.1|
|2018-10-01||> 100 MB||8,218,425||992,914||474,555,730||1,714.9|
Here we can see that most read operations are performed for large files and this is typical for data warehouse workloads.