• Amazon,  AWS,  EMR,  S3,  Spark

    Amazon EMR Spark – Ignoring Partition Filter and Listing All Partitions When Reading from S3A

    I have a partitioned Hive table created by an open-source version of Hadoop that uses S3A scheme as the location for every partition. The table has more than 10,000 partitions and every partition has about 8,000 Parquet files:

    $ hive -e "show partitions events";
    $ hive -e "describe formatted events partition (dateint=20220419, hour='11')" | grep Location
    Location:   s3a://cloudsqale/events/dateint=20220419/hour=11
    $ hive -e "describe formatted events partition (dateint=20220419, hour='12')" | grep Location
    Location:   s3a://cloudsqale/events/dateint=20220419/hour=12
    $ hive -e "describe formatted events partition (dateint=20220419, hour='13')" | grep Location
    Location:   s3a://cloudsqale/events/dateint=20220419/hour=13

    S3A:// is specified for every partition in this table.

    Reading a Partition in Amazon EMR Spark

    When I made an attempt to read data from a single partition using Spark SQL:

    $ spark-sql --master yarn -e "select count(*) from events where dateint=20220419 and hour='11'"

    The Spark driver failed with:

    # java.lang.OutOfMemoryError: GC overhead limit exceeded
    # -XX:OnOutOfMemoryError="kill -9 %p"
    #   Executing /bin/sh -c "kill -9 4847"...
  • AWS,  Flink,  S3

    Flink and S3 Entropy Injection for Checkpoints

    When you use S3 for storing checkpoints it can easily become a bottleneck especially for your Flink application with a lot of subtasks. To overcome this problem FLINK-9061 introduced an entropy ingestion to the checkpoint path.

    But the Flink documentation provides a misleading example (at least up to Flink 1.13) that actually destroys the value of the checkpoint entropy.

  • AWS,  S3

    S3 Multipart Upload – 5 MB Part Size Limit

    It is a well known limitation that Amazon S3 multipart upload requires the part size to be between 5 MB and 5 GB with an exception that the last part can be less than 5 MB.

    Does it mean that you cannot upload a single small file (< 5 MB) to S3 using the multipart upload?

  • AWS,  Hive,  S3

    Hive Table for S3 Access Logs

    Although Amazon S3 can generate a lot of logs and it makes sense to have an ETL process to parse, combine and put the logs into Parquet or ORC format for better query performance, there is still an easy way to analyze logs using a Hive table created just on top of the raw S3 log directory.

  • AWS,  Kinesis

    Kinesis Client Library (KCL 2.x) Consumer – Load Balancing, Rebalancing – Taking, Renewing and Stealing Leases

    For zero-downtime, large-scale systems you can have multiple compute clusters located in different availability zones.

    The Kinesis KCL 2.x Consumer is very helpful to build highly scalable, elastic and fault-tolerant streaming data processing pipelines for Amazon Kinesis. Let’s review some of the KCL internals related to the load balancing and response to compute node/cluster failures and how you can tune and monitor such activities.

  • AWS,  CPU,  EC2,  EMR,  Hadoop,  Qubole,  YARN

    AWS EC2 vCPU and YARN vCores – M4, C4, R4 Instances

    Let’s review how EC2 vCPUs correspond to YARN vCores in Amazon EMR and Qubole Hadoop clusters. As an example, I will choose m4.4xlarge, r4.4xlarge and c4.4xlarge EC2 instance types.

    EC2 vCPU is a thread of a CPU core (typically, there are two threads per core). Does it mean that YARN vCores should be equal to the number of EC2 vCPU? That’s not always the case.

  • AWS,  I/O,  S3

    S3 Multipart Upload – S3 Access Log Messages

    Most applications writing data into S3 use the S3 multipart upload API to upload data in parts. First, you initiate the load, then upload parts and finally complete the multipart upload.

    Let’s see how this operation is reflected in the S3 access log. My application uploaded the file data.gz into S3, and I can view it as follows: