Amazon, AWS, EMR, Spark

EMR Spark – Initial Number of Executors and spark.dynamicAllocation.enabled

August 29, 2022

By default, Spark EMR clusters have spark.dynamicAllocation.enabled set to true meaning that the cluster will dynamically allocate resources to scale the executors up and down whenever required.

But what is the initial number of executors when you start your Spark job?

Read More

dmtolpeko
Amazon, AWS, EMR, Spark

EMR Spark – Much Larger Executors are Created than Requested

August 26, 2022
Starting from EMR 5.32 and EMR 6.2 you can notice that Spark can launch much larger executors that you request in your job settings. For example, EMR created my cluster with the following default settings (it depends on the instance type and maximizeResourceAllocation classification option):
```
  spark.executor.memory                      18971M
  spark.executor.cores                       4
  spark.yarn.executor.memoryOverheadFactor   0.1875
```
But when I start a Spark session (pyspark command) I see the following:
Read More

dmtolpeko

Amazon EMR Spark – Ignoring Partition Filter and Listing All Partitions When Reading from S3A

April 20, 2022

I have a partitioned Hive table created by an open-source version of Hadoop that uses S3A scheme as the location for every partition. The table has more than 10,000 partitions and every partition has about 8,000 Parquet files:

$ hive -e "show partitions events";
...
dateint=20220419/hour=11
dateint=20220419/hour=12
dateint=20220419/hour=13

$ hive -e "describe formatted events partition (dateint=20220419, hour='11')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=11

$ hive -e "describe formatted events partition (dateint=20220419, hour='12')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=12

$ hive -e "describe formatted events partition (dateint=20220419, hour='13')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=13

S3A:// is specified for every partition in this table.

Reading a Partition in Amazon EMR Spark

When I made an attempt to read data from a single partition using Spark SQL:

$ spark-sql --master yarn -e "select count(*) from events where dateint=20220419 and hour='11'"

The Spark driver failed with:

# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 4847"...

AWS, Flink, S3

Flink and S3 Entropy Injection for Checkpoints

January 2, 2021

When you use S3 for storing checkpoints it can easily become a bottleneck especially for your Flink application with a lot of subtasks. To overcome this problem FLINK-9061 introduced an entropy ingestion to the checkpoint path.

But the Flink documentation provides a misleading example (at least up to Flink 1.13) that actually destroys the value of the checkpoint entropy.

Read More

dmtolpeko
AWS, I/O, S3, Storage

S3 Low Latency Writes – Using Aggressive Retries to Get Consistent Latency – Request Timeouts

June 4, 2020

Amazon S3 is highly scalable distributed system that can handle extremely large volumes of data, can adapt to an increasing workload and provide quite good performance as a file storage.

But sometimes you have to tweak it to run faster that can be especially important for latency-sensitive applications.

Read More

dmtolpeko
AWS, S3

S3 Multipart Upload – 5 MB Part Size Limit

May 27, 2020

It is a well known limitation that Amazon S3 multipart upload requires the part size to be between 5 MB and 5 GB with an exception that the last part can be less than 5 MB.

Does it mean that you cannot upload a single small file (< 5 MB) to S3 using the multipart upload?

Read More

dmtolpeko
AWS, Flink, S3

Flink S3 Checkpoints – Monitoring Using S3 Access Logs

May 26, 2020

You can use the Flink Web UI to monitor the checkpoint operations in Flink, but in some cases S3 access logs can provide more information, and can be especially useful if you run many Flink applications.

Read More

dmtolpeko
AWS, Hive, S3

Hive Table for S3 Access Logs

May 26, 2020

Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.

Read More

dmtolpeko
AWS, Kinesis

Kinesis Client Library (KCL 2.x) Consumer – Load Balancing, Rebalancing – Taking, Renewing and Stealing Leases

May 20, 2020

For zero-downtime, large-scale systems you can have multiple compute clusters located in different availability zones.

The Kinesis KCL 2.x Consumer is very helpful to build highly scalable, elastic and fault-tolerant streaming data processing pipelines for Amazon Kinesis. Let’s review some of the KCL internals related to the load balancing and response to compute node/cluster failures and how you can tune and monitor such activities.

Read More

dmtolpeko
AWS, CPU, EC2, EMR, Hadoop, Qubole, YARN

AWS EC2 vCPU and YARN vCores – M4, C4, R4 Instances

May 7, 2020

Let’s review how EC2 vCPUs correspond to YARN vCores in Amazon EMR and Qubole Hadoop clusters. As an example, I will choose m4.4xlarge, r4.4xlarge and c4.4xlarge EC2 instance types.

EC2 vCPU is a thread of a CPU core (typically, there are two threads per core). Does it mean that YARN vCores should be equal to the number of EC2 vCPU? That’s not always the case.

Read More

dmtolpeko