Knowing how Hive table storage is organized can help us extract some additional information for S3 read operations for each table.
In most cases (and you can easily adapt this for your specific table storage pattern), tables are stored in a S3 bucket under the following key structure:
s3://<bucket_name>/hive/<database_name>/<table_name>/<partition1>/<partition2>/...
For example, hourly data for orders
table can be stored as follows:
s3://cloudsqale/hive/sales.db/orders/created_dt=2018-10-18/hour=00/
From S3 access logs we can get all read operations and aggregate them for various S3 key prefix combinations using the following SQL:
select * from ( select path, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) size_tb from ( select split(substr(key, length('s3://cloudsqale/') + 1), '/') p, bytes_sent from s3_access_logs where event_dt ='${EVENT_DT}' and operation='REST.GET.OBJECT' ) t lateral view explode(array( p[0], concat(p[0], '/', p[1]), if(size(p) > 3, concat(p[0], '/', p[1], '/', p[2]), null), if(size(p) > 4, concat(p[0], '/', p[1], '/', p[2], '/', p[3]), null), if(size(p) > 5, concat(p[0], '/', p[1], '/', p[2], '/', p[3], '/', p[4]), null), if(size(p) > 6, concat(p[0], '/', p[1], '/', p[2], '/', p[3], '/', p[4], '/', p[5]), null), if(size(p) > 7, concat(p[0], '/', p[1], '/', p[2], '/', p[3], '/', p[4], '/', p[5], '/', p[6]), null) )) x as path where path is not null group by path ) t where size_tb > 0.01 order by size_tb desc limit 1000000;
In my case the sample output is as follows:
Key prefix | Size, TB |
---|---|
hive | 1617.5 |
hive/events.db | 1355.2 |
hive/events.db/game | 544.8 |
hive/events.db/raw | 424.9 |
hive/analytics.db | 195.8 |
hive/analytics.db/game | 195.8 |
This query can help you identify the most read databases and tables in Hive.
See also, S3 Monitoring Step #3 – Read Operations and File Types.