I/O – Page 2 – Large-Scale Data Engineering in Cloud

I/O, Parquet, Storage

How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations

May 29, 2020

Parquet is one of the most popular columnar file formats used in many tools including Apache Hive, Spark, Presto, Flink and many others.

For tuning Parquet file writes for various workloads and scenarios let’s see how the Parquet writer works in detail (as of Parquet 1.10 but most concepts apply to later versions as well).

Read More

dmtolpeko
AWS, I/O, S3

S3 Multipart Upload – S3 Access Log Messages

April 17, 2020

Most applications writing data into S3 use the S3 multipart upload API to upload data in parts. First, you initiate the load, then upload parts and finally complete the multipart upload.

Let’s see how this operation is reflected in the S3 access log. My application uploaded the file data.gz into S3, and I can view it as follows:

Read More

dmtolpeko
AWS, Flink, I/O, S3

Flink – Tuning Writes to S3 Sink – fs.s3a.threads.max

April 12, 2020

One of our Flink streaming jobs had significant variance in the time spent on writing files to S3 by the same Task Manager process.

What settings do you need to check first?

Read More

dmtolpeko
Amazon, AWS, Hive, I/O, Parquet, S3, Spark

Spark – Slow Load Into Partitioned Hive Table on S3 – Direct Writes, Output Committer Algorithms

December 30, 2019

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

Let’s see what is the reason of such behavior and how we can improve the performance.

Read More

dmtolpeko
I/O, Snowflake

Snowflake – Micro-Partitions and Clustering Depth

December 2, 2019

Traditional data warehouses require you to explicitly specify partition columns for tables using the PARTITION BY clause. There is no PARTITION BY clause in the CREATE TABLE statement in Snowflake although it still heavily relies on partitions.

I already wrote about partitions in Snowflake (see, MIN/MAX Functions and Partition Pruning in Snowflake) but in this article I am going to investigate some more details.

Read More

dmtolpeko
I/O, Snowflake, Storage

Snowflake – Monitoring Data Ingestion using QUERY_HISTORY and COPY_HISTORY – Single Large File vs Multiple Small Files

May 14, 2019

Snowflake provides various options to monitor data ingestion from external storage such as Amazon S3. In this article I am going to review QUERY_HISTORY and COPY_HISTORY table functions.

The COPY commands are widely used to move data into Snowflake on a time-interval basis, and we can monitor their execution accessing the query history with query_type = 'COPY' filter.

Read More

dmtolpeko
I/O, Snowflake

Snowflake – Remote Disk I/O, Local Disk Cache – Capacity, Utilization and Transfer Rate

May 4, 2019
Snowflake uses a cloud storage service such as Amazon S3 as permanent storage for data (Remote Disk in terms of Snowflake), but it can also use Local Disk (SSD) to temporarily cache data used by SQL queries. Let’s test Remote and Local I/O performance by executing a sample SQL query multiple times on X-Large and Medium size Snowflake warehouses:
```
SELECT MIN(event_hour), MAX(event_hour) FROM events WHERE event_name = 'LOGIN';
```
Note that you should disable the Result Cache for queries in your session to perform such tests, otherwise Snowflake will just return the cached result immediately after the first attempt:
```
alter session set USE_CACHED_RESULT = FALSE;
```
Read More

dmtolpeko
Amazon, AWS, EMR, Hive, I/O, S3

S3 Writes When Inserting Data into a Hive Table in Amazon EMR

December 4, 2018
Often in an ETL process we move data from one source into another, typically doing some filtering, transformations and aggregations. Let’s consider which write operations are performed in S3.

Just to focus on S3 writes I am going to use a very simple SQL INSERT statement just moving data from one table into another without any transformations as follows:
```
INSERT OVERWRITE TABLE events PARTITION (event_dt = '2018-12-02', event_hour = '00')
SELECT
  record_id,
  event_timestamp,
  event_name,
  app_name,
  country,
  city,
  payload
FROM events_raw;
```
Read More

dmtolpeko
Hive, I/O, Tez

Tez Internals #1 – Number of Map Tasks

October 22, 2018
Let’s see how Apache Tez defines the number of map tasks when you execute a SQL query in Hive running on the Tez engine.

Consider the following sample SQL query on a partitioned table:
```
select count(*) from events where event_dt = '2018-10-20' and event_name = 'Started';
```
The events table is partitioned by event_dt and event_name columns, and it is quite big – it has 209,146 partitions, while the query requests data from a single partition only:
```
$ hive -e "show partitions events"

event_dt=2017-09-22/event_name=ClientMetrics
event_dt=2017-09-23/event_name=ClientMetrics
...
event_dt=2018-10-20/event_name=Location
...
event_dt=2018-10-20/event_name=Started

Time taken: 26.457 seconds, Fetched: 209146 row(s)
```
Read More

dmtolpeko
Amazon, AWS, I/O, Monitoring, S3, Storage

S3 Monitoring #4 – Read Operations and Tables

October 18, 2018
Knowing how Hive table storage is organized can help us extract some additional information for S3 read operations for each table.

In most cases (and you can easily adapt this for your specific table storage pattern), tables are stored in a S3 bucket under the following key structure:
```
s3://<bucket_name>/hive/<database_name>/<table_name>/<partition1>/<partition2>/...
```
For example, hourly data for orders table can be stored as follows:
```
s3://cloudsqale/hive/sales.db/orders/created_dt=2018-10-18/hour=00/
```
Read More

dmtolpeko