October 2023 – Large-Scale Data Engineering in Cloud

I/O, JSON, Spark

Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues

October 25, 2023
Spark offers a very convenient way to read JSON data. But let’s see some performance implications for reading very large JSON files.

Let’s assume we have a JSON file with records like:
```
{"a":1,  "b":3,  "c":7}
{"a":11, "b":13, "c":17}
{"a":31, "b":33, "c":37, "d":71}
```
Read More

dmtolpeko
Optimizer, Spark

Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations

October 15, 2023

Calculating the number of distinct values is one of the most popular operations in analytics and many queries even contain multiple COUNT DISTINCT expressions on different columns.

Most people realize that this should be a quite heavy calculation. But how is it really resource consuming and what operations are involved? Are there any bottlenecks? Can it be effectively distributed or just runs on a single node? What optimizations are applied?

Let’s see how this is implemented in Spark. We will focus on the exact COUNT DISTINCT calculations, so approximate calculations are out of scope in this article.

Read More

dmtolpeko
I/O, Spark

Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith

October 10, 2023
Often incoming data contain timestamp values (date and time) in the string representation like
2023-07-28 12:50:22.087 i.e., and it is common to run queries with DATE filters as follows:
```
  SELECT *
  FROM incoming_data
  WHERE SUBSTR(created_at, 1, 10) = '2023-10-09';
```
Let’s see what is wrong with such query filters if you read Parquet data in Spark.
Read More

dmtolpeko
Spark

Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

October 6, 2023

Sometimes when running a heavy query in Spark you can see that some stages are restarted multiple times and it may be difficult to understand information about stages in Spark UI.

Read More

dmtolpeko
Spark

Spark AQE – Stage Numeration, Added Jobs at Runtime, Large Number of Tasks, Pending and Skipped Stages

October 1, 2023

With Spark Adaptive Query Execution (AQE) the application view becomes very dynamic in Spark UI and information about jobs and stages may look very confusing: new jobs appear from time to time, the estimated number of tasks is high, there are pending parent stages that suddenly become skipped and so on.

Read More

dmtolpeko