Large-Scale Data Engineering and Analytics in Cloud

Performance Tuning and Cost Optimization / Internals, Research, Consulting

  • About
  • About
  • I/O,  Parquet,  Spark

    Spark – Reading Parquet – Why the Number of Tasks can be Much Larger than the Number of Row Groups

    March 19, 2021

    A row group is a unit of work for reading from Parquet that cannot be split into smaller parts, and you expect that the number of tasks created by Spark is no more than the total number of row groups in your Parquet data source.

    But Spark still can create much more tasks than the number of row groups. Let’s see how this is possible.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Spark

    Spark – Reading Parquet – Predicate Pushdown for LIKE Operator – EqualTo, StartsWith and Contains Pushed Filters

    March 7, 2021

    A Parquet file contains MIN/MAX statistics for every column for every row group that allows Spark applications to skip reading unnecessary data chunks depending on the query predicate. Let’s see how this works with LIKE pattern matching filter.

    For my tests I will use a Parquet file with 4 row groups and the following MIN/MAX statistics for product column:

    Read More
    dmtolpeko
  • Amazon,  AWS,  Hive,  I/O,  Parquet,  S3,  Spark

    Spark – Slow Load Into Partitioned Hive Table on S3 – Direct Writes, Output Committer Algorithms

    December 30, 2019

    I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

    Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

    Let’s see what is the reason of such behavior and how we can improve the performance.

    Read More
    dmtolpeko
  • Amazon,  EMR,  Spark

    Extremely Large Number of RDD Partitions and Tasks in Spark on Amazon EMR

    April 8, 2019

    After creating an Amazon EMR cluster with Spark support, and running a spark application you can notice that the Spark job creates too many tasks to process even a very small data set.

    For example, I have a small table country_iso_codes having 249 rows and stored in a comma-delimited text file with the length of 10,657 bytes.

    When running the following application on Amazon EMR 5.7 cluster with Spark 2.1.1 with the default settings I can see the large number of partitions generated:

    Read More
    dmtolpeko

Recent Posts

  • Mar 19, 2021 Spark – Reading Parquet – Why the Number of Tasks can be Much Larger than the Number of Row Groups
  • Mar 07, 2021 Spark – Reading Parquet – Predicate Pushdown for LIKE Operator – EqualTo, StartsWith and Contains Pushed Filters
  • Jan 15, 2021 Parquet 1.x File Format – Footer Content
  • Jan 02, 2021 Flink and S3 Entropy Injection for Checkpoints
  • Jun 25, 2020 Hadoop YARN – Monitoring Resource Consumption by Running Applications in Multi-Cluster Environments

Archives

  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (11)
  • Auto Scaling (1)
  • AWS (25)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (1)
  • Distributed (1)
  • EC2 (1)
  • EMR (10)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (20)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • ORC (5)
  • Parquet (7)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (17)
  • Snowflake (6)
  • Spark (4)
  • Storage (12)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes