Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.
A parquet file content starts and ends with 4-byte
PAR1 “magic” string. Right before the ending
PAR1 there is 4-byte footer length size (little-endian encoding):
The position of the footer can be easily calculated as:
File_length - Footer_length - 4
The footer itself is encoded using Thrift protocol and contains the file metadata and blocks metadata (row groups):
Note that Parquet 1.x file metadata does not include information about the number of rows and total size, you have to iterate over metadata for all blocks (row groups) in the footer and calculate the total number of rows and data size in the Parquet file.
Also note that min/max column statistics is available per block (row group), there is no min/max column statistics per file. So to skip reading the entire Parquet file based on a pushdown predicate (a query filter) the compute engines have to iterate over metadata for all blocks in the footer and check the min/max statistics for every block to see if the block can be skipped.
So only if all blocks can be skipped then reading of the entire Parquet file can be skipped.