Hacker News new | ask | show | jobs
by davesque 751 days ago
If the parquet file includes any row group stats, then I imagine DuckDB might be able to use those to avoid scanning the entire file. It's definitely possible to request specific sections of a blob stored in S3. But I'm not familiar enough with DuckDB to know whether or not it does this.
2 comments

DuckDB can do some pushdowns to certain file formats like parquet, but every release seems to be getting better and better at doing it.

Parquet pushdowns combined with Hive structuring is a pretty good combination.

There are some HTTP and Metadata caching options in DuckDB, but I haven't really figured out how and when they really making a difference.

It does do that. I can't answer OP's qn about caching though.