I have a partitioned Hive table created by an open-source version of Hadoop that uses S3A scheme as the location for every partition. The table has more than 10,000 partitions and every partition has about 8,000 Parquet files:
$ hive -e "show partitions events"; ... dateint=20220419/hour=11 dateint=20220419/hour=12 dateint=20220419/hour=13 $ hive -e "describe formatted events partition (dateint=20220419, hour='11')" | grep Location Location: s3a://cloudsqale/events/dateint=20220419/hour=11 $ hive -e "describe formatted events partition (dateint=20220419, hour='12')" | grep Location Location: s3a://cloudsqale/events/dateint=20220419/hour=12 $ hive -e "describe formatted events partition (dateint=20220419, hour='13')" | grep Location Location: s3a://cloudsqale/events/dateint=20220419/hour=13
S3A://
is specified for every partition in this table.
Reading a Partition in Amazon EMR Spark
When I made an attempt to read data from a single partition using Spark SQL:
$ spark-sql --master yarn -e "select count(*) from events where dateint=20220419 and hour='11'"
The Spark driver failed with:
# java.lang.OutOfMemoryError: GC overhead limit exceeded # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 4847"...