May 2020 – Large-Scale Data Engineering in Cloud

I/O, Parquet, Storage

How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations

May 29, 2020

Parquet is one of the most popular columnar file formats used in many tools including Apache Hive, Spark, Presto, Flink and many others.

For tuning Parquet file writes for various workloads and scenarios let’s see how the Parquet writer works in detail (as of Parquet 1.10 but most concepts apply to later versions as well).

Read More

dmtolpeko
AWS, S3

S3 Multipart Upload – 5 MB Part Size Limit

May 27, 2020

It is a well known limitation that Amazon S3 multipart upload requires the part size to be between 5 MB and 5 GB with an exception that the last part can be less than 5 MB.

Does it mean that you cannot upload a single small file (< 5 MB) to S3 using the multipart upload?

Read More

dmtolpeko
AWS, Flink, S3

Flink S3 Checkpoints – Monitoring Using S3 Access Logs

May 26, 2020

You can use the Flink Web UI to monitor the checkpoint operations in Flink, but in some cases S3 access logs can provide more information, and can be especially useful if you run many Flink applications.

Read More

dmtolpeko
AWS, Hive, S3

Hive Table for S3 Access Logs

May 26, 2020

Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.

Read More

dmtolpeko
AWS, Kinesis

Kinesis Client Library (KCL 2.x) Consumer – Load Balancing, Rebalancing – Taking, Renewing and Stealing Leases

May 20, 2020

For zero-downtime, large-scale systems you can have multiple compute clusters located in different availability zones.

The Kinesis KCL 2.x Consumer is very helpful to build highly scalable, elastic and fault-tolerant streaming data processing pipelines for Amazon Kinesis. Let’s review some of the KCL internals related to the load balancing and response to compute node/cluster failures and how you can tune and monitor such activities.

Read More

dmtolpeko
CPU, Hadoop, YARN

YARN – Negative vCores – Capacity Scheduler with Memory Resource Type

May 8, 2020

You can expect that the total number of vCores available to YARN limits the number of containers you can run concurrently, that’s not true in some cases.

Let’s consider one of them – Capacity Scheduler with DefaultResourceCalculator (Memory only).

Read More

dmtolpeko
AWS, CPU, EC2, EMR, Hadoop, Qubole, YARN

AWS EC2 vCPU and YARN vCores – M4, C4, R4 Instances

May 7, 2020

Let’s review how EC2 vCPUs correspond to YARN vCores in Amazon EMR and Qubole Hadoop clusters. As an example, I will choose m4.4xlarge, r4.4xlarge and c4.4xlarge EC2 instance types.

EC2 vCPU is a thread of a CPU core (typically, there are two threads per core). Does it mean that YARN vCores should be equal to the number of EC2 vCPU? That’s not always the case.

Read More

dmtolpeko
AWS, S3

S3 REST API – HTTP/1.1 Requests for Uploading Files

May 2, 2020

Let’s review major REST API requests for uploading files to S3 (PutObject, CreateMultipartUpload, UploadPart and CompleteMultipartUpload) that you can observe in S3 access logs.

This can be helpful for monitoring S3 write performance. See also S3 Multipart Upload – S3 Access Log Messages.

Read More

dmtolpeko