April 2022 – Large-Scale Data Engineering in Cloud

I have a partitioned Hive table created by an open-source version of Hadoop that uses S3A scheme as the location for every partition. The table has more than 10,000 partitions and every partition has about 8,000 Parquet files:

$ hive -e "show partitions events";
...
dateint=20220419/hour=11
dateint=20220419/hour=12
dateint=20220419/hour=13

$ hive -e "describe formatted events partition (dateint=20220419, hour='11')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=11

$ hive -e "describe formatted events partition (dateint=20220419, hour='12')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=12

$ hive -e "describe formatted events partition (dateint=20220419, hour='13')" | grep Location
Location:   s3a://cloudsqale/events/dateint=20220419/hour=13

S3A:// is specified for every partition in this table.

Reading a Partition in Amazon EMR Spark

When I made an attempt to read data from a single partition using Spark SQL:

$ spark-sql --master yarn -e "select count(*) from events where dateint=20220419 and hour='11'"

The Spark driver failed with:

# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 4847"...