Skip to content
Large-Scale Data Engineering in Cloud

Performance Tuning, Cost Optimization / Internals, Research by Dmitry Tolpeko

  • About
  • About
  • I/O,  JSON,  Spark

    Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues

    October 25, 2023

    Spark offers a very convenient way to read JSON data. But let’s see some performance implications for reading very large JSON files.

    Let’s assume we have a JSON file with records like:

    {"a":1,  "b":3,  "c":7}
    {"a":11, "b":13, "c":17}
    {"a":31, "b":33, "c":37, "d":71}
    
    Read More
    dmtolpeko
  • I/O,  Spark

    Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith

    October 10, 2023

    Often incoming data contain timestamp values (date and time) in the string representation like
    2023-07-28 12:50:22.087 i.e., and it is common to run queries with DATE filters as follows:

      SELECT *
      FROM incoming_data
      WHERE SUBSTR(created_at, 1, 10) = '2023-10-09';
    

    Let’s see what is wrong with such query filters if you read Parquet data in Spark.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Spark

    Spark – Number of Tasks Reading Large Number of Small Parquet Files

    July 19, 2023

    Sometimes source data arrives from a streaming application as a large set of small Parquet files that you need to compact for more effective read by analytic applications.

    You can observe that by default the number of tasks to read such Parquet files is larger than expected. Let’s see why.

    Read More
    dmtolpeko
  • I/O,  Spark,  Storage

    Spark 2.4 – Slow Performance on Writing into Partitions – Why Sorting Involved

    August 30, 2022

    It is quite typical to write the resulting data into a partitioned table, but performance can be very slow compared with writing the same data volume into a non-partitioned table.

    Let’s investigate why it is slow, and why the sorting operation happens.

    Read More
    dmtolpeko
  • I/O,  Spark,  Storage

    Spark – Create Multiple Output Files per Task using spark.sql.files.maxRecordsPerFile

    August 30, 2022

    It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

    But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file by using spark.sql.files.maxRecordsPerFile option:

    Read More
    dmtolpeko
  • I/O,  Parquet,  Spark

    Spark – Reading Parquet – Why the Number of Tasks can be Much Larger than the Number of Row Groups

    March 19, 2021

    A row group is a unit of work for reading from Parquet that cannot be split into smaller parts, and you expect that the number of tasks created by Spark is no more than the total number of row groups in your Parquet data source.

    But Spark still can create much more tasks than the number of row groups. Let’s see how this is possible.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Spark

    Spark – Reading Parquet – Predicate Pushdown for LIKE Operator – EqualTo, StartsWith and Contains Pushed Filters

    March 7, 2021

    A Parquet file contains MIN/MAX statistics for every column for every row group that allows Spark applications to skip reading unnecessary data chunks depending on the query predicate. Let’s see how this works with LIKE pattern matching filter.

    For my tests I will use a Parquet file with 4 row groups and the following MIN/MAX statistics for product column:

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Map Column is Written to Parquet – Converting JSON to Map to Increase Read Performance

    June 18, 2020

    It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries.

    When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.

    How is it stored? What read performance can you expect? Will json_map["key"] read only data for key or the entire JSON?

    Read More
    dmtolpeko
  • Flink,  I/O,  Parquet,  S3

    Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

    June 9, 2020

    It is quite common to have a streaming Flink application that reads incoming data and puts them into Parquet files with low latency (a couple of minutes) for analysts to be able to run both near-realtime and historical ad-hoc analysis mostly using SQL queries.

    But let’s review write patterns and problems that can appear for such applications at scale.

    Read More
    dmtolpeko
  • AWS,  I/O,  S3,  Storage

    S3 Low Latency Writes – Using Aggressive Retries to Get Consistent Latency – Request Timeouts

    June 4, 2020

    Amazon S3 is highly scalable distributed system that can handle extremely large volumes of data, can adapt to an increasing workload and provide quite good performance as a file storage.

    But sometimes you have to tweak it to run faster that can be especially important for latency-sensitive applications.

    Read More
    dmtolpeko
 Older Posts

Recent Posts

  • Nov 26, 2023 ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew
  • Oct 25, 2023 Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues
  • Oct 15, 2023 Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations
  • Oct 10, 2023 Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith
  • Oct 06, 2023 Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

Archives

  • November 2023 (1)
  • October 2023 (5)
  • September 2023 (1)
  • July 2023 (1)
  • August 2022 (4)
  • April 2022 (1)
  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (14)
  • Auto Scaling (1)
  • AWS (28)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (2)
  • Distributed (1)
  • EC2 (1)
  • EMR (13)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (25)
  • JSON (1)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • Optimizer (2)
  • ORC (5)
  • Parquet (8)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (18)
  • Snowflake (6)
  • Spark (17)
  • Storage (14)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes