• Amazon,  AWS,  EMR,  S3,  Spark

    Amazon EMR Spark – Ignoring Partition Filter and Listing All Partitions When Reading from S3A

    I have a partitioned Hive table created by an open-source version of Hadoop that uses S3A scheme as the location for every partition. The table has more than 10,000 partitions and every partition has about 8,000 Parquet files:

    $ hive -e "show partitions events";
    ...
    dateint=20220419/hour=11
    dateint=20220419/hour=12
    dateint=20220419/hour=13
    
    $ hive -e "describe formatted events partition (dateint=20220419, hour='11')" | grep Location
    Location:   s3a://cloudsqale/events/dateint=20220419/hour=11
    
    $ hive -e "describe formatted events partition (dateint=20220419, hour='12')" | grep Location
    Location:   s3a://cloudsqale/events/dateint=20220419/hour=12
    
    $ hive -e "describe formatted events partition (dateint=20220419, hour='13')" | grep Location
    Location:   s3a://cloudsqale/events/dateint=20220419/hour=13
    

    S3A:// is specified for every partition in this table.

    Reading a Partition in Amazon EMR Spark

    When I made an attempt to read data from a single partition using Spark SQL:

    $ spark-sql --master yarn -e "select count(*) from events where dateint=20220419 and hour='11'"
    

    The Spark driver failed with:

    # java.lang.OutOfMemoryError: GC overhead limit exceeded
    # -XX:OnOutOfMemoryError="kill -9 %p"
    #   Executing /bin/sh -c "kill -9 4847"...