Large-Scale Data Engineering and Analytics in Cloud

Performance Tuning and Optimization / Internals, Research

  • About
  • About
  • Parquet

    Parquet 1.x File Format – Footer Content

    January 15, 2021

    Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.

    A parquet file content starts and ends with 4-byte PAR1 “magic” string. Right before the ending PAR1 there is 4-byte footer length size (little-endian encoding):

    The position of the footer can be easily calculated as: File_length - Footer_length - 4

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Map Column is Written to Parquet – Converting JSON to Map to Increase Read Performance

    June 18, 2020

    It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries.

    When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.

    How is it stored? What read performance can you expect? Will json_map["key"] read only data for key or the entire JSON?

    Read More
    dmtolpeko
  • Flink,  I/O,  Parquet,  S3

    Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

    June 9, 2020

    It is quite common to have a streaming Flink application that reads incoming data and puts them into Parquet files with low latency (a couple of minutes) for analysts to be able to run both near-realtime and historical ad-hoc analysis mostly using SQL queries.

    But let’s review write patterns and problems that can appear for such applications at scale.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations

    May 29, 2020

    Parquet is one of the most popular columnar file formats used in many tools including Apache Hive, Spark, Presto, Flink and many others.

    For tuning Parquet file writes for various workloads and scenarios let’s see how the Parquet writer works in detail (as of Parquet 1.10 but most concepts apply to later versions as well).

    Read More
    dmtolpeko
  • Amazon,  AWS,  Hive,  I/O,  Parquet,  S3,  Spark

    Spark – Slow Load Into Partitioned Hive Table on S3 – Direct Writes, Output Committer Algorithms

    December 30, 2019

    I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

    Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

    Let’s see what is the reason of such behavior and how we can improve the performance.

    Read More
    dmtolpeko

Recent Posts

  • Jan 15, 2021 Parquet 1.x File Format – Footer Content
  • Jan 02, 2021 Flink and S3 Entropy Injection for Checkpoints
  • Jun 25, 2020 Hadoop YARN – Monitoring Resource Consumption by Running Applications in Multi-Cluster Environments
  • Jun 18, 2020 How Map Column is Written to Parquet – Converting JSON to Map to Increase Read Performance
  • Jun 09, 2020 Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

Archives

  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (11)
  • Auto Scaling (1)
  • AWS (25)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (1)
  • Distributed (1)
  • EC2 (1)
  • EMR (10)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (18)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • ORC (5)
  • Parquet (5)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (17)
  • Snowflake (6)
  • Spark (2)
  • Storage (12)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes