After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as .gz
files.
Later I will use the Hive metadata information to define whether files named like 00000_0
are uncompressed text or ORC files.
select type, count(*) keys, count(distinct key) dist_keys,
sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec,
sum(total_time_ms/1000) time_spent,
sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select
key,
case
when key like '%.gz' then 'Compressed .gz'
else 'Other'
end type,
bytes_sent,
total_time_ms
from s3_access_logs
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type;
Here is my sample output:
type |
keys |
dist_keys |
rate_mb_sec |
time_spent |
terabytes_read |
Compressed .gz |
21,535,003 |
7,411,981 |
3.8 |
504,318,631 |
1,812.8 |
Other |
6,345,354 |
647,040 |
18.5 |
1,465,848 |
25.9 |
File Types and Object Size Bins
Now let’s see the distribution of file types for each size bin:
select type, size_type, count(*) keys, count(distinct key) dist_keys,
sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec,
sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select
key,
case
when key like '%.gz' then 'Compressed .gz'
else 'Other'
end type,
case
when total_size <= 1024*1024 then '<= 1 MB'
when total_size <= 30*1024*1024 then '<= 30 MB'
when total_size <= 100*1024*1024 then '<= 100 MB'
else '> 100 MB'
end size_type,
bytes_sent,
total_time_ms
from s3_access_logs
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type, size_type;
Sample output:
type |
size_type |
keys |
dist_keys |
rate_mb_sec |
terabytes_read |
Compressed .gz |
<= 1 MB |
7,759,230 |
3,579,785 |
5.2 |
2.4 |
Compressed .gz |
<= 30 MB |
6,927,405 |
2,456,010 |
4.6 |
47.3 |
Compressed .gz |
<= 100 MB |
1,136,926 |
436,463 |
3.7 |
71.1 |
Compressed .gz |
> 100 MB |
5,711,442 |
939,723 |
3.7 |
1,691.9 |
Other |
<= 1 MB |
2,535,108 |
496,286 |
3.2 |
0.2 |
Other |
<= 30 MB |
1,152,742 |
90,472 |
22.7 |
1.7 |
Other |
<= 100 MB |
150,521 |
7,119 |
14.7 |
1.0 |
Other |
> 100 MB |
2,506,983 |
53,191 |
19.4 |
23.0 |
See also, S3 Monitoring Step #2 – Read Operations.