July 2023 – Large-Scale Data Engineering in Cloud

I/O, Parquet, Spark

Spark – Number of Tasks Reading Large Number of Small Parquet Files

July 19, 2023

Sometimes source data arrives from a streaming application as a large set of small Parquet files that you need to compact for more effective read by analytic applications.

You can observe that by default the number of tasks to read such Parquet files is larger than expected. Let’s see why.

Read More

dmtolpeko