When you run a job in Hadoop you can notice the following error: Application with id 'application_1545962730597_2614' doesn't exist in RM
. And later looking at the YARN Resource Manager UI at http://<RM_IP_Address>:8088/cluster/apps
you can see low Application ID numbers:
-
-
S3 Writes When Inserting Data into a Hive Table in Amazon EMR
Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.
Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00') SELECT record_id, event_timestamp, event_name, app_name, country, city, payload FROM events_raw;
-
Tez Internals #2 – Number of Map Tasks for Large ORC Files with Small Stripes in Amazon EMR
Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.
Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.
-
S3 Monitoring #4 – Read Operations and Tables
Knowing how Hive table storage is organized can help us extract some additional information for S3 read operations for each table.
In most cases (and you can easily adapt this for your specific table storage pattern), tables are stored in a S3 bucket under the following key structure:
s3://<bucket_name>/hive/<database_name>/<table_name>/<partition1>/<partition2>/...
For example, hourly data for
orders
table can be stored as follows:s3://cloudsqale/hive/sales.db/orders/created_dt=2018-10-18/hour=00/
-
S3 Monitoring Step #3 – Read Operations and File Types
After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as
.gz
files.Later I will use the Hive metadata information to define whether files named like
00000_0
are uncompressed text or ORC files.select type, count(*) keys, count(distinct key) dist_keys, sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, sum(total_time_ms/1000) time_spent, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read from ( select key, case when key like '%.gz' then 'Compressed .gz' else 'Other' end type, bytes_sent, total_time_ms from s3_access_logs where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t group by type;
Here is my sample output:
type keys dist_keys rate_mb_sec time_spent terabytes_read Compressed .gz 21,535,003 7,411,981 3.8 504,318,631 1,812.8 Other 6,345,354 647,040 18.5 1,465,848 25.9 File Types and Object Size Bins
Now let’s see the distribution of file types for each size bin:
select type, size_type, count(*) keys, count(distinct key) dist_keys, sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read from ( select key, case when key like '%.gz' then 'Compressed .gz' else 'Other' end type, case when total_size <= 1024*1024 then '<= 1 MB' when total_size <= 30*1024*1024 then '<= 30 MB' when total_size <= 100*1024*1024 then '<= 100 MB' else '> 100 MB' end size_type, bytes_sent, total_time_ms from s3_access_logs where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t group by type, size_type;
Sample output:
type size_type keys dist_keys rate_mb_sec terabytes_read Compressed .gz <= 1 MB 7,759,230 3,579,785 5.2 2.4 Compressed .gz <= 30 MB 6,927,405 2,456,010 4.6 47.3 Compressed .gz <= 100 MB 1,136,926 436,463 3.7 71.1 Compressed .gz > 100 MB 5,711,442 939,723 3.7 1,691.9 Other <= 1 MB 2,535,108 496,286 3.2 0.2 Other <= 30 MB 1,152,742 90,472 22.7 1.7 Other <= 100 MB 150,521 7,119 14.7 1.0 Other > 100 MB 2,506,983 53,191 19.4 23.0 See also, S3 Monitoring Step #2 – Read Operations.
-
S3 Monitoring Step #2 – Read Operations
After you get the first impression about your S3 buckets by looking at their size, number of objects and daily growth rate (see Step #1), it is time to investigate I/O operations in detail.
Let’s start with read operations. Now we need to use S3 Access Logs to get the detailed information about all performed S3
REST.GET.OBJECT
operations. -
S3 Monitoring Step #1 – Bucket Size and Number of Objects
The first step in Amazon S3 monitoring is to check the current state of your S3 buckets and how fast they grow. You can easily get this information from the CloudWatch Management console, running a AWS CLI command or AWS SDK script.
Bucket Size
Here is an example of AWS CLI command to get the size of a bucket for every day within
--start-time
and--end-time
date range:aws cloudwatch get-metric-statistics \ --metric-name BucketSizeBytes --namespace AWS/S3 \ --start-time 2018-10-01T00:00:00Z --end-time 2018-10-08T00:00:00Z \ --statistics Maximum --unit Bytes --region us-east-1 \ --dimensions Name=BucketName,Value=cloudsqale Name=StorageType,Value=StandardStorage \ --period 86400 --query 'Datapoints[*].[Timestamp, Maximum]' \ --output text | sort | python cloudwatch_s3_metrics.py
-
Collecting S3 Access Logs
Amazon allows you to enable S3 access logging that you can use to monitor S3 performance: request rate, I/O workload, user and compute node level statistics, service delays and outages, and much more.
S3 log files are quite small, uncompressed text files that in case of intensive S3 usage can be generated almost every second:
... 2018-09-20 16:20:27 323567 2018-09-20-16-20-26-DE17FAE504462084 2018-09-20 16:20:28 598192 2018-09-20-16-20-27-5F17C98DFA22DA31 2018-09-20 16:20:29 618862 2018-09-20-16-20-28-4660E2CBCB0FB2C5 2018-09-20 16:20:32 381675 2018-09-20-16-20-31-16549B7BABDA06AE 2018-09-20 16:20:33 405131 2018-09-20-16-20-32-14AB46312C254397 2018-09-20 16:20:34 587042 2018-09-20-16-20-33-385E799AFCEBAEE3 2018-09-20 16:20:35 358275 2018-09-20-16-20-34-FA52E601A410E529 2018-09-20 16:20:36 604080 2018-09-20-16-20-35-C02066EDF9026EF9 ...
So you can have 35K+ files generated per day (and there is no a sub-directory for each day), and if you are going to analyze S3 statistics for a long period of time (weeks, months), the performance of your Hive or Presto queries can be very low.
Additionally there is often a lifecycle rule defined to keep logs only for 1-2 days.