Hive, I/O, Tez

Tez Internals #1 – Number of Map Tasks

October 22, 2018
Let’s see how Apache Tez defines the number of map tasks when you execute a SQL query in Hive running on the Tez engine.

Consider the following sample SQL query on a partitioned table:
```
select count(*) from events where event_dt = '2018-10-20' and event_name = 'Started';
```
The events table is partitioned by event_dt and event_name columns, and it is quite big – it has 209,146 partitions, while the query requests data from a single partition only:
```
$ hive -e "show partitions events"

event_dt=2017-09-22/event_name=ClientMetrics
event_dt=2017-09-23/event_name=ClientMetrics
...
event_dt=2018-10-20/event_name=Location
...
event_dt=2018-10-20/event_name=Started

Time taken: 26.457 seconds, Fetched: 209146 row(s)
```
Read More

dmtolpeko
Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring #4 – Read Operations and Tables

October 18, 2018
Knowing how Hive table storage is organized can help us extract some additional information for S3 read operations for each table.

In most cases (and you can easily adapt this for your specific table storage pattern), tables are stored in a S3 bucket under the following key structure:
```
s3://<bucket_name>/hive/<database_name>/<table_name>/<partition1>/<partition2>/...
```
For example, hourly data for orders table can be stored as follows:
```
s3://cloudsqale/hive/sales.db/orders/created_dt=2018-10-18/hour=00/
```
Read More

dmtolpeko

Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring Step #3 – Read Operations and File Types

October 10, 2018

After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as .gz files.

Later I will use the Hive metadata information to define whether files named like 00000_0 are uncompressed text or ORC files.

select type, count(*) keys, count(distinct key) dist_keys, 
  sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
  sum(total_time_ms/1000) time_spent,
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select 
  key,
  case 
    when key like '%.gz' then 'Compressed .gz'
    else 'Other'
  end type,
  bytes_sent,
  total_time_ms
from s3_access_logs 
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type;

Here is my sample output:

type	keys	dist_keys	rate_mb_sec	time_spent	terabytes_read
Compressed .gz	21,535,003	7,411,981	3.8	504,318,631	1,812.8
Other	6,345,354	647,040	18.5	1,465,848	25.9

File Types and Object Size Bins

Now let’s see the distribution of file types for each size bin:

select type, size_type, count(*) keys, count(distinct key) dist_keys, 
  sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select 
  key,
  case 
    when key like '%.gz' then 'Compressed .gz'
    else 'Other'
  end type,
  case 
    when total_size <= 1024*1024 then '<= 1 MB'
    when total_size <= 30*1024*1024 then '<= 30 MB'
    when total_size <= 100*1024*1024 then '<= 100 MB'
    else '> 100 MB'
  end size_type,
  bytes_sent,
  total_time_ms
from s3_access_logs 
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type, size_type;

Sample output:

type	size_type	keys	dist_keys	rate_mb_sec	terabytes_read
Compressed .gz	<= 1 MB	7,759,230	3,579,785	5.2	2.4
Compressed .gz	<= 30 MB	6,927,405	2,456,010	4.6	47.3
Compressed .gz	<= 100 MB	1,136,926	436,463	3.7	71.1
Compressed .gz	> 100 MB	5,711,442	939,723	3.7	1,691.9
Other	<= 1 MB	2,535,108	496,286	3.2	0.2
Other	<= 30 MB	1,152,742	90,472	22.7	1.7
Other	<= 100 MB	150,521	7,119	14.7	1.0
Other	> 100 MB	2,506,983	53,191	19.4	23.0

Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring Step #2 – Read Operations

October 9, 2018

After you get the first impression about your S3 buckets by looking at their size, number of objects and daily growth rate (see Step #1), it is time to investigate I/O operations in detail.

Let’s start with read operations. Now we need to use S3 Access Logs to get the detailed information about all performed S3 REST.GET.OBJECT operations.

Read More

dmtolpeko
Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring Step #1 – Bucket Size and Number of Objects

October 8, 2018
The first step in Amazon S3 monitoring is to check the current state of your S3 buckets and how fast they grow. You can easily get this information from the CloudWatch Management console, running a AWS CLI command or AWS SDK script.

Bucket Size

Here is an example of AWS CLI command to get the size of a bucket for every day within --start-time and --end-time date range:
```
aws cloudwatch get-metric-statistics \
  --metric-name BucketSizeBytes --namespace AWS/S3 \
  --start-time 2018-10-01T00:00:00Z --end-time 2018-10-08T00:00:00Z \
  --statistics Maximum --unit Bytes --region us-east-1 \
  --dimensions Name=BucketName,Value=cloudsqale Name=StorageType,Value=StandardStorage \
  --period 86400 --query 'Datapoints[*].[Timestamp, Maximum]' \
  --output text | sort  | python cloudwatch_s3_metrics.py
```
Read More

dmtolpeko
Hive, I/O, ORC, S3, Storage

Simple Hive Queries with Predicates – Compressed Text vs ORC Files

October 5, 2018
Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

Compressed Text Files

Consider the following sample table:
```
CREATE TABLE clicks
(
   id STRING, 
   name STRING,
   ... 
   referral_id STRING
)
STORED AS TEXTFILE
LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
```
In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.
Read More

dmtolpeko