Skip to content
Large-Scale Data Engineering in Cloud

Performance Tuning, Cost Optimization / Internals, Research by Dmitry Tolpeko

  • About
  • About
  • I/O,  JSON,  Spark

    Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues

    October 25, 2023

    Spark offers a very convenient way to read JSON data. But let’s see some performance implications for reading very large JSON files.

    Let’s assume we have a JSON file with records like:

    {"a":1,  "b":3,  "c":7}
    {"a":11, "b":13, "c":17}
    {"a":31, "b":33, "c":37, "d":71}
    
    Read More
    dmtolpeko
  • Optimizer,  Spark

    Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations

    October 15, 2023

    Calculating the number of distinct values is one of the most popular operations in analytics and many queries even contain multiple COUNT DISTINCT expressions on different columns.

    Most people realize that this should be a quite heavy calculation. But how is it really resource consuming and what operations are involved? Are there any bottlenecks? Can it be effectively distributed or just runs on a single node? What optimizations are applied?

    Let’s see how this is implemented in Spark. We will focus on the exact COUNT DISTINCT calculations, so approximate calculations are out of scope in this article.

    Read More
    dmtolpeko
  • I/O,  Spark

    Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith

    October 10, 2023

    Often incoming data contain timestamp values (date and time) in the string representation like
    2023-07-28 12:50:22.087 i.e., and it is common to run queries with DATE filters as follows:

      SELECT *
      FROM incoming_data
      WHERE SUBSTR(created_at, 1, 10) = '2023-10-09';
    

    Let’s see what is wrong with such query filters if you read Parquet data in Spark.

    Read More
    dmtolpeko
  • Spark

    Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

    October 6, 2023

    Sometimes when running a heavy query in Spark you can see that some stages are restarted multiple times and it may be difficult to understand information about stages in Spark UI.

    Read More
    dmtolpeko
  • Spark

    Spark AQE – Stage Numeration, Added Jobs at Runtime, Large Number of Tasks, Pending and Skipped Stages

    October 1, 2023

    With Spark Adaptive Query Execution (AQE) the application view becomes very dynamic in Spark UI and information about jobs and stages may look very confusing: new jobs appear from time to time, the estimated number of tasks is high, there are pending parent stages that suddenly become skipped and so on.

    Read More
    dmtolpeko

Recent Posts

  • Nov 26, 2023 ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew
  • Oct 25, 2023 Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues
  • Oct 15, 2023 Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations
  • Oct 10, 2023 Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith
  • Oct 06, 2023 Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

Archives

  • November 2023 (1)
  • October 2023 (5)
  • September 2023 (1)
  • July 2023 (1)
  • August 2022 (4)
  • April 2022 (1)
  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (14)
  • Auto Scaling (1)
  • AWS (28)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (2)
  • Distributed (1)
  • EC2 (1)
  • EMR (13)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (25)
  • JSON (1)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • Optimizer (2)
  • ORC (5)
  • Parquet (8)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (18)
  • Snowflake (6)
  • Spark (17)
  • Storage (14)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes