A Parquet file contains MIN/MAX statistics for every column for every row group that allows Spark applications to skip reading unnecessary data chunks depending on the query predicate. Let’s see how this works with LIKE pattern matching filter.
For my tests I will use a Parquet file with 4 row groups and the following MIN/MAX statistics for product
column:
I have created events
table on top of this Parquet file, so I can run the following Spark SQL query as an example:
SELECT name, location FROM events WHERE product = 'Orange';
According to the MIN/MAX statistics that exists for every row group, Spark can only read Row Group 2 (MIN: Grape, MAX: Pineapple
) and Row Group 3 (MIN: Lemon, MAX: Peach
), and try to find rows that qualify product = 'Orange'
condition.
It is not required to read Row Group 1 (MIN: Apple, MAX: Bananas
) and Row Group 4 (MIN: Raspberry, MAX: Xigua
) as the value Orange
is outside of their MIN/MAX range.
If you look at the Spark EXPLAIN EXTENDED output for this query you can see PushedFilters EqualTo
:
== Physical Plan == *(1) Project [product#22] +- *(1) Filter (isnotnull(product#22) && (product#22 = Orange)) +- *(1) FileScan parquet events... PushedFilters: [IsNotNull(product), EqualTo(product,Orange)]
LIKE ‘%value%’
What about product LIKE '%Orange%'
condition (note % before and after the search string):
SELECT name, location FROM events WHERE product LIKE '%Orange%';
Now the physical plan is as follows (note PushedFilters StringContains
):
== Physical Plan == *(1) Project [product#22] +- *(1) Filter (isnotnull(product#22) && (product#22 = Orange)) +- *(1) FileScan parquet events... PushedFilters: [IsNotNull(product), StringContains(product,Orange)]
But although Spark still tries to push StringContains
filter, Spark (and any other engine) has to read all row groups as the MIN/MAX statistics cannot help eliminate reading data chunks based on a search value inside the string.
You can view the Input Size and Records statistics after executing the query to see the increased I/O for this query.
LIKE ‘value%’
And finally let’s consider product LIKE 'Orange%'
condition (now % is after the search string only):
SELECT name, location FROM events WHERE product LIKE 'Orange%';
The physical plan is as follows (note PushedFilters StringStartsWith
):
== Physical Plan == *(1) Project [product#22] +- *(1) Filter (isnotnull(product#22) && (product#22 = Orange)) +- *(1) FileScan parquet events... PushedFilters: [IsNotNull(product), StringStartsWith(product,Orange)]
Unlike product LIKE '%Orange%'
condition, product LIKE 'Orange%'
condition allows Spark to utilize MIN/MAX statistics and skip reading Row Groups 1 and 4, similar to product = 'Orange'
condition.
You can prove that by checking the Input Size and Records statistics after executing the query.