December 2018 – Large-Scale Data Engineering in Cloud

ORC, Storage

Storage Tuning for Mapped JSON Conversion to ORC File Format – Java Heap Issues with Dictionary Encoding

December 26, 2018

Usually in a Data Lake we get source data as compressed JSON payloads (.gz files). Additionally, the first level of JSON objects is often parsed into map<string, string> structure to speed up the access to the first level keys/values, and then get_json_object function can be used to parse further JSON levels whenever required.

But it still makes sense to convert data into the ORC format to evenly distribute data processing to smaller chunks, or to use indexes and optimize query execution for complementary columns such as event names, geo information, and some other system attributes.

In this example we will load the source data stored in single 2.5 GB .gz file into the following ORC table:

Read More

dmtolpeko
Hive, ORC, Tez

ORC Files Split Computation – Hive on Tez

December 24, 2018

In the previous article I already wrote about splits generation (see Tez Internals #2 – Number of Map Tasks for Large ORC Files), and here I would like to share some more details.

I have 143 GB of daily data for clicks located in 33 files in ORC format:

Read More

dmtolpeko
Hive, ORC, Storage

ORC File Format Internals – Creating Large Stripes in Hive Tables

December 17, 2018
Usually the source data arrives as compressed text files, and the first step in an ETL process is to convert them to a columnar format for more effective query execution by users.

Let’s consider a simple example when we have a single 120 MB source file in .gz format:
```
$ aws s3 ls s3://cloudsqale/hive/events.db/events_raw/
2018-12-16 18:49:45  120574494 data.gz
```
and want to convert it into a Hive table with the ORC file format having 256 MB stripe size. Will 120 MB of .gz data be loaded into a single 256 MB stripe? Not so easy.
Read More

dmtolpeko
Amazon, AWS, EMR, Hive, I/O, S3

S3 Writes When Inserting Data into a Hive Table in Amazon EMR

December 4, 2018
Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.

Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:
```
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00')
SELECT
  record_id,
  event_timestamp,
  event_name,
  app_name,
  country,
  city,
  payload
FROM events_raw;
```
Read More

dmtolpeko