AWS,  Hive,  S3

Hive Table for S3 Access Logs

Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.

Here is the DDL I use for Hive table:

CREATE EXTERNAL TABLE s3_access_logs (
  bucket_owner     STRING,
  bucket           STRING,
  request_datetime STRING,
  remote_ip        STRING,
  requester        STRING,
  request_id       STRING,
  operation        STRING,
  key              STRING,
  request_URI      STRING,
  http_status      STRING,
  error_code       STRING,
  bytes_sent       BIGINT,
  object_size      BIGINT,
  total_time       STRING,
  turnaround_time  STRING,
  referrer         STRING,
  user_agent       STRING,
  version_id       STRING,
  host_id          STRING,
  sigv             STRING,
  cipher_suite     STRING,
  auth_type        STRING,
  endpoint         STRING,
  tls_version      STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1', 
'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\\\".*?\\\") (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\"[^\"]*\") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$')
LOCATION 's3://cloudsqale/s3_access_logs';

Just set your LOCATION for the S3 access logs you want to query.