Amazon,  AWS,  I/O,  Monitoring,  S3,  Storage

S3 Monitoring Step #3 – Read Operations and File Types

After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as .gz files.

Later I will use the Hive metadata information to define whether files named like 00000_0 are uncompressed text or ORC files.

select type, count(*) keys, count(distinct key) dist_keys, 
  sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
  sum(total_time_ms/1000) time_spent,
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select 
  key,
  case 
    when key like '%.gz' then 'Compressed .gz'
    else 'Other'
  end type,
  bytes_sent,
  total_time_ms
from s3_access_logs 
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type;

Here is my sample output:

type keys dist_keys rate_mb_sec time_spent terabytes_read
Compressed .gz 21,535,003 7,411,981 3.8 504,318,631 1,812.8
Other 6,345,354 647,040 18.5 1,465,848 25.9

File Types and Object Size Bins

Now let’s see the distribution of file types for each size bin:

select type, size_type, count(*) keys, count(distinct key) dist_keys, 
  sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select 
  key,
  case 
    when key like '%.gz' then 'Compressed .gz'
    else 'Other'
  end type,
  case 
    when total_size <= 1024*1024 then '<= 1 MB'
    when total_size <= 30*1024*1024 then '<= 30 MB'
    when total_size <= 100*1024*1024 then '<= 100 MB'
    else '> 100 MB'
  end size_type,
  bytes_sent,
  total_time_ms
from s3_access_logs 
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type, size_type;

Sample output:

type size_type keys dist_keys rate_mb_sec terabytes_read
Compressed .gz <= 1 MB 7,759,230 3,579,785 5.2 2.4
Compressed .gz <= 30 MB 6,927,405 2,456,010 4.6 47.3
Compressed .gz <= 100 MB 1,136,926 436,463 3.7 71.1
Compressed .gz > 100 MB 5,711,442 939,723 3.7 1,691.9
Other <= 1 MB 2,535,108 496,286 3.2 0.2
Other <= 30 MB 1,152,742 90,472 22.7 1.7
Other <= 100 MB 150,521 7,119 14.7 1.0
Other > 100 MB 2,506,983 53,191 19.4 23.0

See also, S3 Monitoring Step #2 – Read Operations.