Storage – Page 2 – Large-Scale Data Engineering in Cloud

Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring Step #2 – Read Operations

October 9, 2018

After you get the first impression about your S3 buckets by looking at their size, number of objects and daily growth rate (see Step #1), it is time to investigate I/O operations in detail.

Let’s start with read operations. Now we need to use S3 Access Logs to get the detailed information about all performed S3 REST.GET.OBJECT operations.

Read More

dmtolpeko
Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring Step #1 – Bucket Size and Number of Objects

October 8, 2018
The first step in Amazon S3 monitoring is to check the current state of your S3 buckets and how fast they grow. You can easily get this information from the CloudWatch Management console, running a AWS CLI command or AWS SDK script.

Bucket Size

Here is an example of AWS CLI command to get the size of a bucket for every day within --start-time and --end-time date range:
```
aws cloudwatch get-metric-statistics \
  --metric-name BucketSizeBytes --namespace AWS/S3 \
  --start-time 2018-10-01T00:00:00Z --end-time 2018-10-08T00:00:00Z \
  --statistics Maximum --unit Bytes --region us-east-1 \
  --dimensions Name=BucketName,Value=cloudsqale Name=StorageType,Value=StandardStorage \
  --period 86400 --query 'Datapoints[*].[Timestamp, Maximum]' \
  --output text | sort  | python cloudwatch_s3_metrics.py
```
Read More

dmtolpeko
Hive, I/O, ORC, S3, Storage

Simple Hive Queries with Predicates – Compressed Text vs ORC Files

October 5, 2018
Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

Compressed Text Files

Consider the following sample table:
```
CREATE TABLE clicks
(
   id STRING, 
   name STRING,
   ... 
   referral_id STRING
)
STORED AS TEXTFILE
LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
```
In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.
Read More

dmtolpeko
Amazon, AWS, I/O, Logs, S3, Storage

Collecting S3 Access Logs

September 25, 2018
Amazon allows you to enable S3 access logging that you can use to monitor S3 performance: request rate, I/O workload, user and compute node level statistics, service delays and outages, and much more.

S3 log files are quite small, uncompressed text files that in case of intensive S3 usage can be generated almost every second:
```
...
2018-09-20 16:20:27     323567 2018-09-20-16-20-26-DE17FAE504462084
2018-09-20 16:20:28     598192 2018-09-20-16-20-27-5F17C98DFA22DA31
2018-09-20 16:20:29     618862 2018-09-20-16-20-28-4660E2CBCB0FB2C5
2018-09-20 16:20:32     381675 2018-09-20-16-20-31-16549B7BABDA06AE
2018-09-20 16:20:33     405131 2018-09-20-16-20-32-14AB46312C254397
2018-09-20 16:20:34     587042 2018-09-20-16-20-33-385E799AFCEBAEE3
2018-09-20 16:20:35     358275 2018-09-20-16-20-34-FA52E601A410E529
2018-09-20 16:20:36     604080 2018-09-20-16-20-35-C02066EDF9026EF9
...
```
So you can have 35K+ files generated per day (and there is no a sub-directory for each day), and if you are going to analyze S3 statistics for a long period of time (weeks, months), the performance of your Hive or Presto queries can be very low.

Additionally there is often a lifecycle rule defined to keep logs only for 1-2 days.
Read More

dmtolpeko