Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.
Here is the DDL I use for Hive table:
CREATE EXTERNAL TABLE s3_access_logs ( bucket_owner STRING, bucket STRING, request_datetime STRING, remote_ip STRING, requester STRING, request_id STRING, operation STRING, key STRING, request_URI STRING, http_status STRING, error_code STRING, bytes_sent BIGINT, object_size BIGINT, total_time STRING, turnaround_time STRING, referrer STRING, user_agent STRING, version_id STRING, host_id STRING, sigv STRING, cipher_suite STRING, auth_type STRING, endpoint STRING, tls_version STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\\\".*?\\\") (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\"[^\"]*\") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$') LOCATION 's3://cloudsqale/s3_access_logs';
Just set your LOCATION for the S3 access logs you want to query.