Spark – Large-Scale Data Engineering in Cloud

Data Skew, Optimizer, Spark

ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew

November 26, 2023

Global sorting is one the most important operations on data, and it is not only used to define how you can see the query result in UI but more importantly it is widely used to solve various performance issues in data pipelines i.e. to provide a better data compression, clustering, pruning and so on.

Let’s see how ORDER BY is implemented in Spark.

Read More

dmtolpeko
I/O, JSON, Spark

Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues

October 25, 2023
Spark offers a very convenient way to read JSON data. But let’s see some performance implications for reading very large JSON files.

Let’s assume we have a JSON file with records like:
```
{"a":1,  "b":3,  "c":7}
{"a":11, "b":13, "c":17}
{"a":31, "b":33, "c":37, "d":71}
```
Read More

dmtolpeko
Optimizer, Spark

Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations

October 15, 2023

Calculating the number of distinct values is one of the most popular operations in analytics and many queries even contain multiple COUNT DISTINCT expressions on different columns.

Most people realize that this should be a quite heavy calculation. But how is it really resource consuming and what operations are involved? Are there any bottlenecks? Can it be effectively distributed or just runs on a single node? What optimizations are applied?

Let’s see how this is implemented in Spark. We will focus on the exact COUNT DISTINCT calculations, so approximate calculations are out of scope in this article.

Read More

dmtolpeko
I/O, Spark

Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith

October 10, 2023
Often incoming data contain timestamp values (date and time) in the string representation like
2023-07-28 12:50:22.087 i.e., and it is common to run queries with DATE filters as follows:
```
  SELECT *
  FROM incoming_data
  WHERE SUBSTR(created_at, 1, 10) = '2023-10-09';
```
Let’s see what is wrong with such query filters if you read Parquet data in Spark.
Read More

dmtolpeko
Spark

Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

October 6, 2023

Sometimes when running a heavy query in Spark you can see that some stages are restarted multiple times and it may be difficult to understand information about stages in Spark UI.

Read More

dmtolpeko
Spark

Spark AQE – Stage Numeration, Added Jobs at Runtime, Large Number of Tasks, Pending and Skipped Stages

October 1, 2023

With Spark Adaptive Query Execution (AQE) the application view becomes very dynamic in Spark UI and information about jobs and stages may look very confusing: new jobs appear from time to time, the estimated number of tasks is high, there are pending parent stages that suddenly become skipped and so on.

Read More

dmtolpeko
Spark

Spark – LIMIT on Large Datasets – CollectLimit, GlobalLimit, LocalLimit, spark.sql.limit.scaleUpFactor

September 17, 2023

You use the LIMIT clause to quickly browse and review data samples, so you expect that such queries complete in less than a second. But let’s consider Spark’s LIMIT behaviour on very large data sets and what performance issues you may have.

Read More

dmtolpeko
I/O, Parquet, Spark

Spark – Number of Tasks Reading Large Number of Small Parquet Files

July 19, 2023

Sometimes source data arrives from a streaming application as a large set of small Parquet files that you need to compact for more effective read by analytic applications.

You can observe that by default the number of tasks to read such Parquet files is larger than expected. Let’s see why.

Read More

dmtolpeko
I/O, Spark, Storage

Spark 2.4 – Slow Performance on Writing into Partitions – Why Sorting Involved

August 30, 2022

It is quite typical to write the resulting data into a partitioned table, but performance can be very slow compared with writing the same data volume into a non-partitioned table.

Let’s investigate why it is slow, and why the sorting operation happens.

Read More

dmtolpeko
I/O, Spark, Storage

Spark – Create Multiple Output Files per Task using spark.sql.files.maxRecordsPerFile

August 30, 2022

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file by using spark.sql.files.maxRecordsPerFile option:

Read More

dmtolpeko