Amazon EMR Spark – Ignoring Partition Filter and Listing All Partitions When Reading from S3A

April 20, 2022

I have a partitioned Hive table created by an open-source version of Hadoop that uses S3A scheme as the location for every partition. The table has more than 10,000 partitions and every partition has about 8,000 Parquet files:

$ hive -e "show partitions events";
...
dateint=20220419/hour=11
dateint=20220419/hour=12
dateint=20220419/hour=13

$ hive -e "describe formatted events partition (dateint=20220419, hour='11')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=11

$ hive -e "describe formatted events partition (dateint=20220419, hour='12')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=12

$ hive -e "describe formatted events partition (dateint=20220419, hour='13')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=13

S3A:// is specified for every partition in this table.

Reading a Partition in Amazon EMR Spark

When I made an attempt to read data from a single partition using Spark SQL:

$ spark-sql --master yarn -e "select count(*) from events where dateint=20220419 and hour='11'"

The Spark driver failed with:

# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 4847"...

AWS, Flink, S3

Flink and S3 Entropy Injection for Checkpoints

January 2, 2021

When you use S3 for storing checkpoints it can easily become a bottleneck especially for your Flink application with a lot of subtasks. To overcome this problem FLINK-9061 introduced an entropy ingestion to the checkpoint path.

But the Flink documentation provides a misleading example (at least up to Flink 1.13) that actually destroys the value of the checkpoint entropy.

Read More

dmtolpeko
Flink, I/O, Parquet, S3

Flink Streaming to Parquet Files in S3 – Massive Write IOPS on Checkpoint

June 9, 2020

It is quite common to have a streaming Flink application that reads incoming data and puts them into Parquet files with low latency (a couple of minutes) for analysts to be able to run both near-realtime and historical ad-hoc analysis mostly using SQL queries.

But let’s review write patterns and problems that can appear for such applications at scale.

Read More

dmtolpeko
AWS, I/O, S3, Storage

S3 Low Latency Writes – Using Aggressive Retries to Get Consistent Latency – Request Timeouts

June 4, 2020

Amazon S3 is highly scalable distributed system that can handle extremely large volumes of data, can adapt to an increasing workload and provide quite good performance as a file storage.

But sometimes you have to tweak it to run faster that can be especially important for latency-sensitive applications.

Read More

dmtolpeko
AWS, S3

S3 Multipart Upload – 5 MB Part Size Limit

May 27, 2020

It is a well known limitation that Amazon S3 multipart upload requires the part size to be between 5 MB and 5 GB with an exception that the last part can be less than 5 MB.

Does it mean that you cannot upload a single small file (< 5 MB) to S3 using the multipart upload?

Read More

dmtolpeko
AWS, Flink, S3

Flink S3 Checkpoints – Monitoring Using S3 Access Logs

May 26, 2020

You can use the Flink Web UI to monitor the checkpoint operations in Flink, but in some cases S3 access logs can provide more information, and can be especially useful if you run many Flink applications.

Read More

dmtolpeko
AWS, Hive, S3

Hive Table for S3 Access Logs

May 26, 2020

Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.

Read More

dmtolpeko
AWS, S3

S3 REST API – HTTP/1.1 Requests for Uploading Files

May 2, 2020

Let’s review major REST API requests for uploading files to S3 (PutObject, CreateMultipartUpload, UploadPart and CompleteMultipartUpload) that you can observe in S3 access logs.

This can be helpful for monitoring S3 write performance. See also S3 Multipart Upload – S3 Access Log Messages.

Read More

dmtolpeko
AWS, I/O, S3

S3 Multipart Upload – S3 Access Log Messages

April 17, 2020

Most applications writing data into S3 use the S3 multipart upload API to upload data in parts. First, you initiate the load, then upload parts and finally complete the multipart upload.

Let’s see how this operation is reflected in the S3 access log. My application uploaded the file data.gz into S3, and I can view it as follows:

Read More

dmtolpeko
AWS, Flink, I/O, S3

Flink – Tuning Writes to S3 Sink – fs.s3a.threads.max

April 12, 2020

One of our Flink streaming jobs had significant variance in the time spent on writing files to S3 by the same Task Manager process.

What settings do you need to check first?

Read More

dmtolpeko