I have a partitioned Hive table created by an open-source version of Hadoop that uses S3A scheme as the location for every partition. The table has more than 10,000 partitions and every partition has about 8,000 Parquet files:
$ hive -e "show partitions events";
...
dateint=20220419/hour=11
dateint=20220419/hour=12
dateint=20220419/hour=13
$ hive -e "describe formatted events partition (dateint=20220419, hour='11')" | grep Location
Location: s3a://cloudsqale/events/dateint=20220419/hour=11
$ hive -e "describe formatted events partition (dateint=20220419, hour='12')" | grep Location
Location: s3a://cloudsqale/events/dateint=20220419/hour=12
$ hive -e "describe formatted events partition (dateint=20220419, hour='13')" | grep Location
Location: s3a://cloudsqale/events/dateint=20220419/hour=13
S3A://
is specified for every partition in this table.
Reading a Partition in Amazon EMR Spark
When I made an attempt to read data from a single partition using Spark SQL:
$ spark-sql --master yarn -e "select count(*) from events where dateint=20220419 and hour='11'"
The Spark driver failed with:
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 4847"...