Hacker News new | ask | show | jobs
by MSM 2011 days ago
EDIT: After looking into it, it seems like Spark calls both things predicate pushdowns (eliminating unnecessary row group reads via the statistics AND pushing the predicates down to the lowest possible level). You're right, I'm wrong!

>Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.

A nitpick, but I wouldn't call this predicate pushdown, it's partition (or segment) elimination. A predicate being pushed down potentially allows files to be skipped through this process though

1 comments

It's min/max per row group, so (potentially) huge chunks of the Parquet file don't need to be read from disk if only a subset qualify.