EMR – Page 2 – Large-Scale Data Engineering in Cloud

Amazon, AWS, EMR, YARN

YARN Resource Manager Silent Restarts – Java Heap Space Error – Amazon EMR

January 4, 2019

When you run a job in Hadoop you can notice the following error: Application with id 'application_1545962730597_2614' doesn't exist in RM. And later looking at the YARN Resource Manager UI at http://<RM_IP_Address>:8088/cluster/apps you can see low Application ID numbers:

Read More

dmtolpeko
Amazon, AWS, EMR, Hive, I/O, S3

S3 Writes When Inserting Data into a Hive Table in Amazon EMR

December 4, 2018
Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.

Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:
```
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00')
SELECT
  record_id,
  event_timestamp,
  event_name,
  app_name,
  country,
  city,
  payload
FROM events_raw;
```
Read More

dmtolpeko
Amazon, AWS, EMR, Hive, ORC, Tez

Tez Internals #2 – Number of Map Tasks for Large ORC Files with Small Stripes in Amazon EMR

November 12, 2018

Let’s see how Hive on Tez defines the number of map tasks when the input data is stored in large ORC files but having small stripes.

Note. All experiments below were executed on Amazon Hive 2.1.1. This article does not apply to Qubole running on Amazon AWS. Qubole has a different algorithm to define the number of map tasks for ORC files.

Read More

dmtolpeko