Skip to content
Large-Scale Data Engineering in Cloud

Performance Tuning, Cost Optimization / Internals, Research by Dmitry Tolpeko

  • About
  • About
  • AWS,  I/O,  S3,  Storage

    S3 Low Latency Writes – Using Aggressive Retries to Get Consistent Latency – Request Timeouts

    June 4, 2020

    Amazon S3 is highly scalable distributed system that can handle extremely large volumes of data, can adapt to an increasing workload and provide quite good performance as a file storage.

    But sometimes you have to tweak it to run faster that can be especially important for latency-sensitive applications.

    Read More
    dmtolpeko
  • I/O,  Parquet,  Storage

    How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations

    May 29, 2020

    Parquet is one of the most popular columnar file formats used in many tools including Apache Hive, Spark, Presto, Flink and many others.

    For tuning Parquet file writes for various workloads and scenarios let’s see how the Parquet writer works in detail (as of Parquet 1.10 but most concepts apply to later versions as well).

    Read More
    dmtolpeko
  • AWS,  S3

    S3 Multipart Upload – 5 MB Part Size Limit

    May 27, 2020

    It is a well known limitation that Amazon S3 multipart upload requires the part size to be between 5 MB and 5 GB with an exception that the last part can be less than 5 MB.

    Does it mean that you cannot upload a single small file (< 5 MB) to S3 using the multipart upload?

    Read More
    dmtolpeko
  • AWS,  Flink,  S3

    Flink S3 Checkpoints – Monitoring Using S3 Access Logs

    May 26, 2020

    You can use the Flink Web UI to monitor the checkpoint operations in Flink, but in some cases S3 access logs can provide more information, and can be especially useful if you run many Flink applications.

    Read More
    dmtolpeko
  • AWS,  Hive,  S3

    Hive Table for S3 Access Logs

    May 26, 2020

    Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.

    Read More
    dmtolpeko
  • AWS,  Kinesis

    Kinesis Client Library (KCL 2.x) Consumer – Load Balancing, Rebalancing – Taking, Renewing and Stealing Leases

    May 20, 2020

    For zero-downtime, large-scale systems you can have multiple compute clusters located in different availability zones.

    The Kinesis KCL 2.x Consumer is very helpful to build highly scalable, elastic and fault-tolerant streaming data processing pipelines for Amazon Kinesis. Let’s review some of the KCL internals related to the load balancing and response to compute node/cluster failures and how you can tune and monitor such activities.

    Read More
    dmtolpeko
  • CPU,  Hadoop,  YARN

    YARN – Negative vCores – Capacity Scheduler with Memory Resource Type

    May 8, 2020

    You can expect that the total number of vCores available to YARN limits the number of containers you can run concurrently, that’s not true in some cases.

    Let’s consider one of them – Capacity Scheduler with DefaultResourceCalculator (Memory only).

    Read More
    dmtolpeko
  • AWS,  CPU,  EC2,  EMR,  Hadoop,  Qubole,  YARN

    AWS EC2 vCPU and YARN vCores – M4, C4, R4 Instances

    May 7, 2020

    Let’s review how EC2 vCPUs correspond to YARN vCores in Amazon EMR and Qubole Hadoop clusters. As an example, I will choose m4.4xlarge, r4.4xlarge and c4.4xlarge EC2 instance types.

    EC2 vCPU is a thread of a CPU core (typically, there are two threads per core). Does it mean that YARN vCores should be equal to the number of EC2 vCPU? That’s not always the case.

    Read More
    dmtolpeko
  • AWS,  S3

    S3 REST API – HTTP/1.1 Requests for Uploading Files

    May 2, 2020

    Let’s review major REST API requests for uploading files to S3 (PutObject, CreateMultipartUpload, UploadPart and CompleteMultipartUpload) that you can observe in S3 access logs.

    This can be helpful for monitoring S3 write performance. See also S3 Multipart Upload – S3 Access Log Messages.

    Read More
    dmtolpeko
  • Flink,  JVM,  Memory,  YARN

    Flink 1.9 – Off-Heap Memory on YARN – Troubleshooting Container is Running Beyond Physical Memory Limits Errors

    April 29, 2020

    On one of my clusters I got my favorite YARN error, although now it was in a Flink application:

    Container is running beyond physical memory limits. Current usage: 99.5 GB of 99.5 GB physical memory used; 105.1 GB of 227.8 GB virtual memory used. Killing container.

    Why did the container take so much physical memory and fail? Let’s investigate in detail.

    Read More
    dmtolpeko
 Older Posts
Newer Posts 

Recent Posts

  • Nov 26, 2023 ORDER BY in Spark – How Global Sort Is Implemented, Sampling, Range Rartitioning and Skew
  • Oct 25, 2023 Reading JSON in Spark – Full Read for Inferring Schema and Sampling, SamplingRatio Option Implementation and Issues
  • Oct 15, 2023 Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations
  • Oct 10, 2023 Spark – Reading Parquet – Pushed Filters, SUBSTR(timestamp, 1, 10), LIKE and StringStartsWith
  • Oct 06, 2023 Spark Stage Restarts – Partial Restarts, Multiple Retry Attempts with Different Task Sets, Accepted Late Results from Failed Stages, Cost of Restarts

Archives

  • November 2023 (1)
  • October 2023 (5)
  • September 2023 (1)
  • July 2023 (1)
  • August 2022 (4)
  • April 2022 (1)
  • March 2021 (2)
  • January 2021 (2)
  • June 2020 (4)
  • May 2020 (8)
  • April 2020 (3)
  • February 2020 (3)
  • December 2019 (5)
  • November 2019 (4)
  • October 2019 (1)
  • September 2019 (2)
  • August 2019 (1)
  • May 2019 (9)
  • April 2019 (2)
  • January 2019 (3)
  • December 2018 (4)
  • November 2018 (1)
  • October 2018 (6)
  • September 2018 (2)

Categories

  • Amazon (14)
  • Auto Scaling (1)
  • AWS (28)
  • Cost Optimization (1)
  • CPU (2)
  • Data Skew (2)
  • Distributed (1)
  • EC2 (1)
  • EMR (13)
  • ETL (2)
  • Flink (5)
  • Hadoop (14)
  • Hive (17)
  • Hue (1)
  • I/O (25)
  • JSON (1)
  • JVM (3)
  • Kinesis (1)
  • Logs (1)
  • Memory (7)
  • Monitoring (4)
  • Optimizer (2)
  • ORC (5)
  • Parquet (8)
  • Pig (2)
  • Presto (3)
  • Qubole (2)
  • RDS (1)
  • S3 (18)
  • Snowflake (6)
  • Spark (17)
  • Storage (14)
  • Tez (10)
  • YARN (18)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Savona Theme by Optima Themes