| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by FridgeSeal 720 days ago
	Parquet is designed with predicate push-down in mind. Partitions are laid out on disk, and then blocks within files are laid out so that consumers can very, very easily narrow in on which files they need to read, before doing anymore IO than a list, or a small metadata read. Once you know what you are reading, many parquet/arrow libraries will support streaming reads/aggregations, so the client doesn’t need to load the whole working set in memory.

1 comments

This only covers very simple min / max / sum cases.

For all others you'll need to download all columns you are filtering or selecting.