| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marklit 256 days ago

That 5 TB of data will probably be 3-400 GB in Parquet. Try and denormalise the data into a few datasets or just one dataset if you can.

DuckDB querying the data should be able to return results in milliseconds if the smaller columns are being used a better if the row-group stats can be used to answer queries.

You can host those Parquet files on a local disk or S3. A local disk might be cheaper if this is exposed to the outside world as well as giving you a price ceiling on hosting.

If you have a Parquet file with billions of records and row-groups measuring into the thousands then hosting on something like Cloudflare where there is a per-request charge could get a bit expensive if this is a popular dataset. At a minimum, DuckDB will look at the stats for each row-group for any column involved with a query. It might be cheaper just to pay for 400 GB of storage with your hosting provider.

There is a project to convert OSM to Parquet every week and we had to look into some of those issues https://github.com/osmus/layercake/issues/22