Collecting S3 Access Logs – Large-Scale Data Engineering in Cloud

Amazon allows you to enable S3 access logging that you can use to monitor S3 performance: request rate, I/O workload, user and compute node level statistics, service delays and outages, and much more.

S3 log files are quite small, uncompressed text files that in case of intensive S3 usage can be generated almost every second:

...
2018-09-20 16:20:27     323567 2018-09-20-16-20-26-DE17FAE504462084
2018-09-20 16:20:28     598192 2018-09-20-16-20-27-5F17C98DFA22DA31
2018-09-20 16:20:29     618862 2018-09-20-16-20-28-4660E2CBCB0FB2C5
2018-09-20 16:20:32     381675 2018-09-20-16-20-31-16549B7BABDA06AE
2018-09-20 16:20:33     405131 2018-09-20-16-20-32-14AB46312C254397
2018-09-20 16:20:34     587042 2018-09-20-16-20-33-385E799AFCEBAEE3
2018-09-20 16:20:35     358275 2018-09-20-16-20-34-FA52E601A410E529
2018-09-20 16:20:36     604080 2018-09-20-16-20-35-C02066EDF9026EF9
...

So you can have 35K+ files generated per day (and there is no a sub-directory for each day), and if you are going to analyze S3 statistics for a long period of time (weeks, months), the performance of your Hive or Presto queries can be very low.

Additionally there is often a lifecycle rule defined to keep logs only for 1-2 days.

It makes sense to set up an ETL process that parses files, aggregates records from 100-300 log files into a single log file, compresses the resulting file and puts it into a Hive table partition:

CREATE TABLE s3_access_logs (
  datetime_req STRING,
  remote_ip STRING,
  requester STRING,
  operation STRING,
  key STRING,
  http_status STRING,
  error_code STRING,
  bytes_sent BIGINT,
  total_size BIGINT,
  total_time_ms BIGINT,
  turnaround_ms BIGINT,
  user_agent STRING
)
PARTITIONED BY (event_dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://cloudsqale/hive/dmtolpeko.db/s3_access_logs';

With this approach you get the small number of large compressed files, located in partition directories that you can query quite quickly now:

SELECT datetime_req, total_size, total_time_ms, http_status 
FROM s3_access_logs 
WHERE event_dt = '2018-09-20' AND 
  operation = 'REST.GET.OBJECT'

Sample result:

Preserving S3 logs in a compact form for long period of time will help you monitor and troubleshoot S3 performance more efficiently.