Hacker News new | ask | show | jobs
by xoofoog 409 days ago
Wow - that's super clever. How do you get away with loading part of the file? Which part do you load?
1 comments

I’m not OP but as this is a common pattern…

Parquet stores the metadata in the footer so first request is effectively a negative byte range (content length minus footer length). This metadata includes table statistics like “column ‘date_sold’ has minimum date 1-1-1970 and maximum date 12-31-2024,” and row group statistics like “the row group at byte offset X has minimum ‘date_sold’ value of 1-1-2023 and maximum ‘1-1-2024’.”

So if your query tool gets a SQL query with a predicate like “WHERE date_sold > ‘3-1-2024’ AND date_sold < ‘3-30-2024’” then it can use “partition pruning” to fetch only the RowGroup of the parquet file that includes the March 2024 data.

My colleague Artjoms (and co-founder of Splitgraph with me) gave a great presentation [0] on how we achieved this with DataFusion, including visualization of the pruning.

[0] https://youtube.com/watch?v=D_phetiS-4w