Skip to content
Large-Scale Data Engineering in Cloud

Performance Tuning, Cost Optimization / Internals, Research by Dmitry Tolpeko

  • About
  • About
  • I/O,  Spark,  Storage

    Spark 2.4 – Slow Performance on Writing into Partitions – Why Sorting Involved

    August 30, 2022

    It is quite typical to write the resulting data into a partitioned table, but performance can be very slow compared with writing the same data volume into a non-partitioned table.

    Let’s investigate why it is slow, and why the sorting operation happens.

    Read More
    dmtolpeko
  • I/O,  Spark,  Storage

    Spark – Create Multiple Output Files per Task using spark.sql.files.maxRecordsPerFile

    August 30, 2022

    It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

    But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file by using spark.sql.files.maxRecordsPerFile option:

    Read More
    dmtolpeko
  • Amazon,  AWS,  EMR,  Spark

    EMR Spark – Initial Number of Executors and spark.dynamicAllocation.enabled

    August 29, 2022

    By default, Spark EMR clusters have spark.dynamicAllocation.enabled set to true meaning that the cluster will dynamically allocate resources to scale the executors up and down whenever required.

    But what is the initial number of executors when you start your Spark job?

    Read More
    dmtolpeko
  • Amazon,  AWS,  EMR,  Spark

    EMR Spark – Much Larger Executors are Created than Requested

    August 26, 2022

    Starting from EMR 5.32 and EMR 6.2 you can notice that Spark can launch much larger executors that you request in your job settings. For example, EMR created my cluster with the following default settings (it depends on the instance type and maximizeResourceAllocation classification option):

      spark.executor.memory                      18971M
      spark.executor.cores                       4
      spark.yarn.executor.memoryOverheadFactor   0.1875
    

    But when I start a Spark session (pyspark command) I see the following:

    Read More
    dmtolpeko

Recent Posts

  • Nov 26, 2023 ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew
  • Oct 25, 2023 Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues
  • Oct 15, 2023 Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations
  • Oct 10, 2023 Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith
  • Oct 06, 2023 Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

Archives

  • November 2023 (1)
  • October 2023 (5)
  • September 2023 (1)
  • July 2023 (1)
  • August 2022 (4)
  • April 2022 (1)
  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (14)
  • Auto Scaling (1)
  • AWS (28)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (2)
  • Distributed (1)
  • EC2 (1)
  • EMR (13)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (25)
  • JSON (1)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • Optimizer (2)
  • ORC (5)
  • Parquet (8)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (18)
  • Snowflake (6)
  • Spark (17)
  • Storage (14)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes