S3 Monitoring #4 – Read Operations and Tables – Large-Scale Data Engineering in Cloud

Knowing how Hive table storage is organized can help us extract some additional information for S3 read operations for each table.

In most cases (and you can easily adapt this for your specific table storage pattern), tables are stored in a S3 bucket under the following key structure:

s3://<bucket_name>/hive/<database_name>/<table_name>/<partition1>/<partition2>/...

For example, hourly data for orders table can be stored as follows:

s3://cloudsqale/hive/sales.db/orders/created_dt=2018-10-18/hour=00/

From S3 access logs we can get all read operations and aggregate them for various S3 key prefix combinations using the following SQL:

select * 
from (
  select path, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) size_tb
  from (
    select split(substr(key, length('s3://cloudsqale/') + 1), '/') p, bytes_sent
    from s3_access_logs 
    where event_dt ='${EVENT_DT}' and operation='REST.GET.OBJECT'
) t 
lateral view explode(array(
 p[0],
 concat(p[0], '/', p[1]),
 if(size(p) > 3, concat(p[0], '/', p[1], '/', p[2]), null),
 if(size(p) > 4, concat(p[0], '/', p[1], '/', p[2], '/', p[3]), null),
 if(size(p) > 5, concat(p[0], '/', p[1], '/', p[2], '/', p[3], '/', p[4]), null),
 if(size(p) > 6, concat(p[0], '/', p[1], '/', p[2], '/', p[3], '/', p[4], '/', p[5]), null),
 if(size(p) > 7, concat(p[0], '/', p[1], '/', p[2], '/', p[3], '/', p[4], '/', p[5], '/', p[6]), null)
 )) x as path
where path is not null
group by path
) t
where size_tb > 0.01
order by size_tb desc
limit 1000000;

In my case the sample output is as follows:

Key prefix	Size, TB
hive	1617.5
hive/events.db	1355.2
hive/events.db/game	544.8
hive/events.db/raw	424.9
hive/analytics.db	195.8
hive/analytics.db/game	195.8

This query can help you identify the most read databases and tables in Hive.