Parquet – Large-Scale Data Engineering in Cloud

I/O, Parquet, Spark

Spark – Number of Tasks Reading Large Number of Small Parquet Files

July 19, 2023

Sometimes source data arrives from a streaming application as a large set of small Parquet files that you need to compact for more effective read by analytic applications.

You can observe that by default the number of tasks to read such Parquet files is larger than expected. Let’s see why.

Read More

dmtolpeko
I/O, Parquet, Spark

Spark – Reading Parquet – Why the Number of Tasks can be Much Larger than the Number of Row Groups

March 19, 2021

A row group is a unit of work for reading from Parquet that cannot be split into smaller parts, and you expect that the number of tasks created by Spark is no more than the total number of row groups in your Parquet data source.

But Spark still can create much more tasks than the number of row groups. Let’s see how this is possible.

Read More

dmtolpeko
I/O, Parquet, Spark

Spark – Reading Parquet – Predicate Pushdown for LIKE Operator – EqualTo, StartsWith and Contains Pushed Filters

March 7, 2021

A Parquet file contains MIN/MAX statistics for every column for every row group that allows Spark applications to skip reading unnecessary data chunks depending on the query predicate. Let’s see how this works with LIKE pattern matching filter.

For my tests I will use a Parquet file with 4 row groups and the following MIN/MAX statistics for product column:

Read More

dmtolpeko
Parquet

Parquet 1.x File Format – Footer Content

January 15, 2021

Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.

A parquet file content starts and ends with 4-byte PAR1 “magic” string. Right before the ending PAR1 there is 4-byte footer length size (little-endian encoding):

The position of the footer can be easily calculated as: File_length - Footer_length - 4

Read More

dmtolpeko
I/O, Parquet, Storage

How Map Column is Written to Parquet – Converting JSON to Map to Increase Read Performance

June 18, 2020

It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries.

When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.

How is it stored? What read performance can you expect? Will json_map["key"] read only data for key or the entire JSON?

Read More

dmtolpeko
Flink, I/O, Parquet, S3

Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

June 9, 2020

It is quite common to have a streaming Flink application that reads incoming data and puts them into Parquet files with low latency (a couple of minutes) for analysts to be able to run both near-realtime and historical ad-hoc analysis mostly using SQL queries.

But let’s review write patterns and problems that can appear for such applications at scale.

Read More

dmtolpeko
I/O, Parquet, Storage

How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations

May 29, 2020

Parquet is one of the most popular columnar file formats used in many tools including Apache Hive, Spark, Presto, Flink and many others.

For tuning Parquet file writes for various workloads and scenarios let’s see how the Parquet writer works in detail (as of Parquet 1.10 but most concepts apply to later versions as well).

Read More

dmtolpeko
Amazon, AWS, Hive, I/O, Parquet, S3, Spark

Spark – Slow Load Into Partitioned Hive Table on S3 – Direct Writes, Output Committer Algorithms

December 30, 2019

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

Let’s see what is the reason of such behavior and how we can improve the performance.

Read More

dmtolpeko