After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as .gz
files.
Later I will use the Hive metadata information to define whether files named like 00000_0
are uncompressed text or ORC files.
select type, count(*) keys, count(distinct key) dist_keys, sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, sum(total_time_ms/1000) time_spent, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read from ( select key, case when key like '%.gz' then 'Compressed .gz' else 'Other' end type, bytes_sent, total_time_ms from s3_access_logs where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t group by type;
Here is my sample output:
type | keys | dist_keys | rate_mb_sec | time_spent | terabytes_read |
---|---|---|---|---|---|
Compressed .gz | 21,535,003 | 7,411,981 | 3.8 | 504,318,631 | 1,812.8 |
Other | 6,345,354 | 647,040 | 18.5 | 1,465,848 | 25.9 |
File Types and Object Size Bins
Now let’s see the distribution of file types for each size bin:
select type, size_type, count(*) keys, count(distinct key) dist_keys, sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read from ( select key, case when key like '%.gz' then 'Compressed .gz' else 'Other' end type, case when total_size <= 1024*1024 then '<= 1 MB' when total_size <= 30*1024*1024 then '<= 30 MB' when total_size <= 100*1024*1024 then '<= 100 MB' else '> 100 MB' end size_type, bytes_sent, total_time_ms from s3_access_logs where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t group by type, size_type;
Sample output:
type | size_type | keys | dist_keys | rate_mb_sec | terabytes_read |
---|---|---|---|---|---|
Compressed .gz | <= 1 MB | 7,759,230 | 3,579,785 | 5.2 | 2.4 |
Compressed .gz | <= 30 MB | 6,927,405 | 2,456,010 | 4.6 | 47.3 |
Compressed .gz | <= 100 MB | 1,136,926 | 436,463 | 3.7 | 71.1 |
Compressed .gz | > 100 MB | 5,711,442 | 939,723 | 3.7 | 1,691.9 |
Other | <= 1 MB | 2,535,108 | 496,286 | 3.2 | 0.2 |
Other | <= 30 MB | 1,152,742 | 90,472 | 22.7 | 1.7 |
Other | <= 100 MB | 150,521 | 7,119 | 14.7 | 1.0 |
Other | > 100 MB | 2,506,983 | 53,191 | 19.4 | 23.0 |
See also, S3 Monitoring Step #2 – Read Operations.