August 2022 – Large-Scale Data Engineering in Cloud

I/O, Spark, Storage

Spark 2.4 – Slow Performance on Writing into Partitions – Why Sorting Involved

August 30, 2022

It is quite typical to write the resulting data into a partitioned table, but performance can be very slow compared with writing the same data volume into a non-partitioned table.

Let’s investigate why it is slow, and why the sorting operation happens.

Read More

dmtolpeko
I/O, Spark, Storage

Spark – Create Multiple Output Files per Task using spark.sql.files.maxRecordsPerFile

August 30, 2022

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file by using spark.sql.files.maxRecordsPerFile option:

Read More

dmtolpeko
Amazon, AWS, EMR, Spark

EMR Spark – Initial Number of Executors and spark.dynamicAllocation.enabled

August 29, 2022

By default, Spark EMR clusters have spark.dynamicAllocation.enabled set to true meaning that the cluster will dynamically allocate resources to scale the executors up and down whenever required.

But what is the initial number of executors when you start your Spark job?

Read More

dmtolpeko
Amazon, AWS, EMR, Spark

EMR Spark – Much Larger Executors are Created than Requested

August 26, 2022
Starting from EMR 5.32 and EMR 6.2 you can notice that Spark can launch much larger executors that you request in your job settings. For example, EMR created my cluster with the following default settings (it depends on the instance type and maximizeResourceAllocation classification option):
```
  spark.executor.memory                      18971M
  spark.executor.cores                       4
  spark.yarn.executor.memoryOverheadFactor   0.1875
```
But when I start a Spark session (pyspark command) I see the following:
Read More

dmtolpeko