March 2021 – Large-Scale Data Engineering in Cloud

I/O, Parquet, Spark

Spark – Reading Parquet – Why the Number of Tasks can be Much Larger than the Number of Row Groups

March 19, 2021

A row group is a unit of work for reading from Parquet that cannot be split into smaller parts, and you expect that the number of tasks created by Spark is no more than the total number of row groups in your Parquet data source.

But Spark still can create much more tasks than the number of row groups. Let’s see how this is possible.

Read More

dmtolpeko
I/O, Parquet, Spark

Spark – Reading Parquet – Predicate Pushdown for LIKE Operator – EqualTo, StartsWith and Contains Pushed Filters

March 7, 2021

A Parquet file contains MIN/MAX statistics for every column for every row group that allows Spark applications to skip reading unnecessary data chunks depending on the query predicate. Let’s see how this works with LIKE pattern matching filter.

For my tests I will use a Parquet file with 4 row groups and the following MIN/MAX statistics for product column:

Read More

dmtolpeko