Skip to content
Large-Scale Data Engineering in Cloud

Performance Tuning, Cost Optimization / Internals, Research by Dmitry Tolpeko

  • About
  • About
  • I/O,  Parquet,  Spark

    Spark – Number of Tasks Reading Large Number of Small Parquet Files

    July 19, 2023

    Sometimes source data arrives from a streaming application as a large set of small Parquet files that you need to compact for more effective read by analytic applications.

    You can observe that by default the number of tasks to read such Parquet files is larger than expected. Let’s see why.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Spark

    Spark – Reading Parquet – Why the Number of Tasks can be Much Larger than the Number of Row Groups

    March 19, 2021

    A row group is a unit of work for reading from Parquet that cannot be split into smaller parts, and you expect that the number of tasks created by Spark is no more than the total number of row groups in your Parquet data source.

    But Spark still can create much more tasks than the number of row groups. Let’s see how this is possible.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Spark

    Spark – Reading Parquet – Predicate Pushdown for LIKE Operator – EqualTo, StartsWith and Contains Pushed Filters

    March 7, 2021

    A Parquet file contains MIN/MAX statistics for every column for every row group that allows Spark applications to skip reading unnecessary data chunks depending on the query predicate. Let’s see how this works with LIKE pattern matching filter.

    For my tests I will use a Parquet file with 4 row groups and the following MIN/MAX statistics for product column:

    Read More
    dmtolpeko
  • Parquet

    Parquet 1.x File Format – Footer Content

    January 15, 2021

    Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.

    A parquet file content starts and ends with 4-byte PAR1 “magic” string. Right before the ending PAR1 there is 4-byte footer length size (little-endian encoding):

    The position of the footer can be easily calculated as: File_length - Footer_length - 4

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Map Column is Written to Parquet – Converting JSON to Map to Increase Read Performance

    June 18, 2020

    It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries.

    When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.

    How is it stored? What read performance can you expect? Will json_map["key"] read only data for key or the entire JSON?

    Read More
    dmtolpeko
  • Flink,  I/O,  Parquet,  S3

    Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

    June 9, 2020

    It is quite common to have a streaming Flink application that reads incoming data and puts them into Parquet files with low latency (a couple of minutes) for analysts to be able to run both near-realtime and historical ad-hoc analysis mostly using SQL queries.

    But let’s review write patterns and problems that can appear for such applications at scale.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations

    May 29, 2020

    Parquet is one of the most popular columnar file formats used in many tools including Apache Hive, Spark, Presto, Flink and many others.

    For tuning Parquet file writes for various workloads and scenarios let’s see how the Parquet writer works in detail (as of Parquet 1.10 but most concepts apply to later versions as well).

    Read More
    dmtolpeko
  • Amazon,  AWS,  Hive,  I/O,  Parquet,  S3,  Spark

    Spark – Slow Load Into Partitioned Hive Table on S3 – Direct Writes, Output Committer Algorithms

    December 30, 2019

    I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

    Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

    Let’s see what is the reason of such behavior and how we can improve the performance.

    Read More
    dmtolpeko

Recent Posts

  • Nov 26, 2023 ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew
  • Oct 25, 2023 Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues
  • Oct 15, 2023 Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations
  • Oct 10, 2023 Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith
  • Oct 06, 2023 Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

Archives

  • November 2023 (1)
  • October 2023 (5)
  • September 2023 (1)
  • July 2023 (1)
  • August 2022 (4)
  • April 2022 (1)
  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (14)
  • Auto Scaling (1)
  • AWS (28)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (2)
  • Distributed (1)
  • EC2 (1)
  • EMR (13)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (25)
  • JSON (1)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • Optimizer (2)
  • ORC (5)
  • Parquet (8)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (18)
  • Snowflake (6)
  • Spark (17)
  • Storage (14)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes