Hive, Memory, Tez, YARN

Tez Memory Tuning – Container is Running Beyond Physical Memory Limits – Solving By Reducing Memory Settings

January 21, 2019

Can reducing the Tez memory settings help solving memory limit problems? Sometimes this paradox works.

One day one of our Hive query failed with the following error: Container is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 6.0 GB of 20 GB virtual memory used. Killing container.

Read More

dmtolpeko
Amazon, AWS, EMR, YARN

YARN Resource Manager Silent Restarts – Java Heap Space Error – Amazon EMR

January 4, 2019

When you run a job in Hadoop you can notice the following error: Application with id 'application_1545962730597_2614' doesn't exist in RM. And later looking at the YARN Resource Manager UI at http://<RM_IP_Address>:8088/cluster/apps you can see low Application ID numbers:

Read More

dmtolpeko
ORC, Storage

Storage Tuning for Mapped JSON Conversion to ORC File Format – Java Heap Issues with Dictionary Encoding

December 26, 2018

Usually in a Data Lake we get source data as compressed JSON payloads (.gz files). Additionally, the first level of JSON objects is often parsed into map<string, string> structure to speed up the access to the first level keys/values, and then get_json_object function can be used to parse further JSON levels whenever required.

But it still makes sense to convert data into the ORC format to evenly distribute data processing to smaller chunks, or to use indexes and optimize query execution for complementary columns such as event names, geo information, and some other system attributes.

In this example we will load the source data stored in single 2.5 GB .gz file into the following ORC table:

Read More

dmtolpeko
Hive, ORC, Tez

ORC Files Split Computation – Hive on Tez

December 24, 2018

In the previous article I already wrote about splits generation (see Tez Internals #2 – Number of Map Tasks for Large ORC Files), and here I would like to share some more details.

I have 143 GB of daily data for clicks located in 33 files in ORC format:

Read More

dmtolpeko
Hive, ORC, Storage

ORC File Format Internals – Creating Large Stripes in Hive Tables

December 17, 2018
Usually the source data arrives as compressed text files, and the first step in an ETL process is to convert them to a columnar format for more effective query execution by users.

Let’s consider a simple example when we have a single 120 MB source file in .gz format:
```
$ aws s3 ls s3://cloudsqale/hive/events.db/events_raw/
2018-12-16 18:49:45  120574494 data.gz
```
and want to convert it into a Hive table with the ORC file format having 256 MB stripe size. Will 120 MB of .gz data be loaded into a single 256 MB stripe? Not so easy.
Read More

dmtolpeko
Amazon, AWS, EMR, Hive, I/O, S3

S3 Writes When Inserting Data into a Hive Table in Amazon EMR

December 4, 2018
Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.

Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:
```
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00')
SELECT
  record_id,
  event_timestamp,
  event_name,
  app_name,
  country,
  city,
  payload
FROM events_raw;
```
Read More

dmtolpeko
Amazon, AWS, EMR, Hive, ORC, Tez

Tez Internals #2 – Number of Map Tasks for Large ORC Files with Small Stripes in Amazon EMR

November 12, 2018

Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.

Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.

Read More

dmtolpeko
Hive, I/O, Tez

Tez Internals #1 – Number of Map Tasks

October 22, 2018
Let’s see how Apache Tez defines the number of map tasks when you execute a SQL query in Hive running on the Tez engine.

Consider the following sample SQL query on a partitioned table:
```
select count(*) from events where event_dt = '2018-10-20' and event_name = 'Started';
```
The events table is partitioned by event_dt and event_name columns, and it is quite big – it has 209,146 partitions, while the query requests data from a single partition only:
```
$ hive -e "show partitions events"

event_dt=2017-09-22/event_name=ClientMetrics
event_dt=2017-09-23/event_name=ClientMetrics
...
event_dt=2018-10-20/event_name=Location
...
event_dt=2018-10-20/event_name=Started

Time taken: 26.457 seconds, Fetched: 209146 row(s)
```
Read More

dmtolpeko
Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring #4 – Read Operations and Tables

October 18, 2018
Knowing how Hive table storage is organized can help us extract some additional information for S3 read operations for each table.

In most cases (and you can easily adapt this for your specific table storage pattern), tables are stored in a S3 bucket under the following key structure:
```
s3://<bucket_name>/hive/<database_name>/<table_name>/<partition1>/<partition2>/...
```
For example, hourly data for orders table can be stored as follows:
```
s3://cloudsqale/hive/sales.db/orders/created_dt=2018-10-18/hour=00/
```
Read More

dmtolpeko

Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring Step #3 – Read Operations and File Types

October 10, 2018

After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as .gz files.

Later I will use the Hive metadata information to define whether files named like 00000_0 are uncompressed text or ORC files.

select type, count(*) keys, count(distinct key) dist_keys, 
  sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
  sum(total_time_ms/1000) time_spent,
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select 
  key,
  case 
    when key like '%.gz' then 'Compressed .gz'
    else 'Other'
  end type,
  bytes_sent,
  total_time_ms
from s3_access_logs 
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type;

Here is my sample output:

type	keys	dist_keys	rate_mb_sec	time_spent	terabytes_read
Compressed .gz	21,535,003	7,411,981	3.8	504,318,631	1,812.8
Other	6,345,354	647,040	18.5	1,465,848	25.9

File Types and Object Size Bins

Now let’s see the distribution of file types for each size bin:

select type, size_type, count(*) keys, count(distinct key) dist_keys, 
  sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
  sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
from (
select 
  key,
  case 
    when key like '%.gz' then 'Compressed .gz'
    else 'Other'
  end type,
  case 
    when total_size <= 1024*1024 then '<= 1 MB'
    when total_size <= 30*1024*1024 then '<= 30 MB'
    when total_size <= 100*1024*1024 then '<= 100 MB'
    else '> 100 MB'
  end size_type,
  bytes_sent,
  total_time_ms
from s3_access_logs 
where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
group by type, size_type;

Sample output:

type	size_type	keys	dist_keys	rate_mb_sec	terabytes_read
Compressed .gz	<= 1 MB	7,759,230	3,579,785	5.2	2.4
Compressed .gz	<= 30 MB	6,927,405	2,456,010	4.6	47.3
Compressed .gz	<= 100 MB	1,136,926	436,463	3.7	71.1
Compressed .gz	> 100 MB	5,711,442	939,723	3.7	1,691.9
Other	<= 1 MB	2,535,108	496,286	3.2	0.2
Other	<= 30 MB	1,152,742	90,472	22.7	1.7
Other	<= 100 MB	150,521	7,119	14.7	1.0
Other	> 100 MB	2,506,983	53,191	19.4	23.0