| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MSM 2011 days ago

EDIT: After looking into it, it seems like Spark calls both things predicate pushdowns (eliminating unnecessary row group reads via the statistics AND pushing the predicates down to the lowest possible level). You're right, I'm wrong!

>Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.

A nitpick, but I wouldn't call this predicate pushdown, it's partition (or segment) elimination. A predicate being pushed down potentially allows files to be skipped through this process though

1 comments

tomnipotent 2011 days ago

It's min/max per row group, so (potentially) huge chunks of the Parquet file don't need to be read from disk if only a subset qualify.

link