| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cmollis 996 days ago
	this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..

1 comments

wenc 996 days ago

No you can definitely use globs in DuckDB.

And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)

https://duckdb.org/docs/data/partitioning/hive_partitioning

link

cmollis 996 days ago

ok.. thanks.. I'll try it out. I can think of few use-case that we have where this might be a good alternative to athena.

link