Skip to content
Large-Scale Data Engineering in Cloud

Performance Tuning, Cost Optimization / Internals, Research by Dmitry Tolpeko

  • About
  • About
  • AWS,  Flink,  S3

    Flink and S3 Entropy Injection for Checkpoints

    January 2, 2021

    When you use S3 for storing checkpoints it can easily become a bottleneck especially for your Flink application with a lot of subtasks. To overcome this problem FLINK-9061 introduced an entropy ingestion to the checkpoint path.

    But the Flink documentation provides a misleading example (at least up to Flink 1.13) that actually destroys the value of the checkpoint entropy.

    Read More
    dmtolpeko
  • Hadoop,  YARN

    Hadoop YARN – Monitoring Resource Consumption by Running Applications in Multi-Cluster Environments

    June 25, 2020

    In cloud it is typical to run multiple compute clusters, so browsing the Web UI for every cluster to check the current resource consumption by applications is not always easy and convenient especially if YARN clusters are managed by different Hadoop distributions (Amazon EMR, Cloudera, Qubole etc.).

    Let’s see how you can automate this process and find out how many applications are running and which resources they are consuming (containers, memory and CPU).

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Map Column is Written to Parquet – Converting JSON to Map to Increase Read Performance

    June 18, 2020

    It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries.

    When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.

    How is it stored? What read performance can you expect? Will json_map["key"] read only data for key or the entire JSON?

    Read More
    dmtolpeko
  • Flink,  I/O,  Parquet,  S3

    Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

    June 9, 2020

    It is quite common to have a streaming Flink application that reads incoming data and puts them into Parquet files with low latency (a couple of minutes) for analysts to be able to run both near-realtime and historical ad-hoc analysis mostly using SQL queries.

    But let’s review write patterns and problems that can appear for such applications at scale.

    Read More
    dmtolpeko
  • AWS,  I/O,  S3,  Storage

    S3 Low Latency Writes – Using Aggressive Retries to Get Consistent Latency – Request Timeouts

    June 4, 2020

    Amazon S3 is highly scalable distributed system that can handle extremely large volumes of data, can adapt to an increasing workload and provide quite good performance as a file storage.

    But sometimes you have to tweak it to run faster that can be especially important for latency-sensitive applications.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations

    May 29, 2020

    Parquet is one of the most popular columnar file formats used in many tools including Apache Hive, Spark, Presto, Flink and many others.

    For tuning Parquet file writes for various workloads and scenarios let’s see how the Parquet writer works in detail (as of Parquet 1.10 but most concepts apply to later versions as well).

    Read More
    dmtolpeko
  • AWS,  S3

    S3 Multipart Upload – 5 MB Part Size Limit

    May 27, 2020

    It is a well known limitation that Amazon S3 multipart upload requires the part size to be between 5 MB and 5 GB with an exception that the last part can be less than 5 MB.

    Does it mean that you cannot upload a single small file (< 5 MB) to S3 using the multipart upload?

    Read More
    dmtolpeko
  • AWS,  Flink,  S3

    Flink S3 Checkpoints – Monitoring Using S3 Access Logs

    May 26, 2020

    You can use the Flink Web UI to monitor the checkpoint operations in Flink, but in some cases S3 access logs can provide more information, and can be especially useful if you run many Flink applications.

    Read More
    dmtolpeko
  • AWS,  Hive,  S3

    Hive Table for S3 Access Logs

    May 26, 2020

    Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.

    Read More
    dmtolpeko
  • AWS,  Kinesis

    Kinesis Client Library (KCL 2.x) Consumer – Load Balancing, Rebalancing – Taking, Renewing and Stealing Leases

    May 20, 2020

    For zero-downtime, large-scale systems you can have multiple compute clusters located in different availability zones.

    The Kinesis KCL 2.x Consumer is very helpful to build highly scalable, elastic and fault-tolerant streaming data processing pipelines for Amazon Kinesis. Let’s review some of the KCL internals related to the load balancing and response to compute node/cluster failures and how you can tune and monitor such activities.

    Read More
    dmtolpeko
 Older Posts
Newer Posts 

Recent Posts

  • Sep 17, 2023 Spark – LIMIT on Large Datasets – CollectLimit, GlobalLimit, LocalLimit, spark.sql.limit.scaleUpFactor
  • Jul 19, 2023 Spark – Number of Tasks Reading Large Number of Small Parquet Files
  • Aug 30, 2022 Spark 2.4 – Slow Performance on Writing into Partitions – Why Sorting Involved
  • Aug 30, 2022 Spark – Create Multiple Output Files per Task using spark.sql.files.maxRecordsPerFile
  • Aug 29, 2022 EMR Spark – Initial Number of Executors and spark.dynamicAllocation.enabled

Archives

  • September 2023 (1)
  • July 2023 (1)
  • August 2022 (4)
  • April 2022 (1)
  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (14)
  • Auto Scaling (1)
  • AWS (28)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (1)
  • Distributed (1)
  • EC2 (1)
  • EMR (13)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (23)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • ORC (5)
  • Parquet (8)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (18)
  • Snowflake (6)
  • Spark (11)
  • Storage (14)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes