Skip to content
Large-Scale Data Engineering in Cloud

Performance Tuning, Cost Optimization / Internals, Research by Dmitry Tolpeko

  • About
  • About
  • Data Skew,  Optimizer,  Spark

    ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew

    November 26, 2023

    Global sorting is one the most important operations on data, and it is not only used to define how you can see the query result in UI but more importantly it is widely used to solve various performance issues in data pipelines i.e. to provide a better data compression, clustering, pruning and so on.

    Let’s see how ORDER BY is implemented in Spark.

    Read More
    dmtolpeko
  • I/O,  JSON,  Spark

    Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues

    October 25, 2023

    Spark offers a very convenient way to read JSON data. But let’s see some performance implications for reading very large JSON files.

    Let’s assume we have a JSON file with records like:

    {"a":1,  "b":3,  "c":7}
    {"a":11, "b":13, "c":17}
    {"a":31, "b":33, "c":37, "d":71}
    
    Read More
    dmtolpeko
  • Optimizer,  Spark

    Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations

    October 15, 2023

    Calculating the number of distinct values is one of the most popular operations in analytics and many queries even contain multiple COUNT DISTINCT expressions on different columns.

    Most people realize that this should be a quite heavy calculation. But how is it really resource consuming and what operations are involved? Are there any bottlenecks? Can it be effectively distributed or just runs on a single node? What optimizations are applied?

    Let’s see how this is implemented in Spark. We will focus on the exact COUNT DISTINCT calculations, so approximate calculations are out of scope in this article.

    Read More
    dmtolpeko
  • I/O,  Spark

    Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith

    October 10, 2023

    Often incoming data contain timestamp values (date and time) in the string representation like
    2023-07-28 12:50:22.087 i.e., and it is common to run queries with DATE filters as follows:

      SELECT *
      FROM incoming_data
      WHERE SUBSTR(created_at, 1, 10) = '2023-10-09';
    

    Let’s see what is wrong with such query filters if you read Parquet data in Spark.

    Read More
    dmtolpeko
  • Spark

    Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

    October 6, 2023

    Sometimes when running a heavy query in Spark you can see that some stages are restarted multiple times and it may be difficult to understand information about stages in Spark UI.

    Read More
    dmtolpeko
  • Spark

    Spark AQE – Stage Numeration, Added Jobs at Runtime, Large Number of Tasks, Pending and Skipped Stages

    October 1, 2023

    With Spark Adaptive Query Execution (AQE) the application view becomes very dynamic in Spark UI and information about jobs and stages may look very confusing: new jobs appear from time to time, the estimated number of tasks is high, there are pending parent stages that suddenly become skipped and so on.

    Read More
    dmtolpeko
  • Spark

    Spark – LIMIT on Large Datasets – CollectLimit, GlobalLimit, LocalLimit, spark.sql.limit.scaleUpFactor

    September 17, 2023

    You use the LIMIT clause to quickly browse and review data samples, so you expect that such queries complete in less than a second. But let’s consider Spark’s LIMIT behaviour on very large data sets and what performance issues you may have.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Spark

    Spark – Number of Tasks Reading Large Number of Small Parquet Files

    July 19, 2023

    Sometimes source data arrives from a streaming application as a large set of small Parquet files that you need to compact for more effective read by analytic applications.

    You can observe that by default the number of tasks to read such Parquet files is larger than expected. Let’s see why.

    Read More
    dmtolpeko
  • I/O,  Spark,  Storage

    Spark 2.4 – Slow Performance on Writing into Partitions – Why Sorting Involved

    August 30, 2022

    It is quite typical to write the resulting data into a partitioned table, but performance can be very slow compared with writing the same data volume into a non-partitioned table.

    Let’s investigate why it is slow, and why the sorting operation happens.

    Read More
    dmtolpeko
  • I/O,  Spark,  Storage

    Spark – Create Multiple Output Files per Task using spark.sql.files.maxRecordsPerFile

    August 30, 2022

    It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

    But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file by using spark.sql.files.maxRecordsPerFile option:

    Read More
    dmtolpeko
 Older Posts

Recent Posts

  • Nov 26, 2023 ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew
  • Oct 25, 2023 Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues
  • Oct 15, 2023 Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations
  • Oct 10, 2023 Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith
  • Oct 06, 2023 Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

Archives

  • November 2023 (1)
  • October 2023 (5)
  • September 2023 (1)
  • July 2023 (1)
  • August 2022 (4)
  • April 2022 (1)
  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (14)
  • Auto Scaling (1)
  • AWS (28)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (2)
  • Distributed (1)
  • EC2 (1)
  • EMR (13)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (25)
  • JSON (1)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • Optimizer (2)
  • ORC (5)
  • Parquet (8)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (18)
  • Snowflake (6)
  • Spark (17)
  • Storage (14)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes