|
|
|
|
|
by marklit
256 days ago
|
|
That 5 TB of data will probably be 3-400 GB in Parquet. Try and denormalise the data into a few datasets or just one dataset if you can. DuckDB querying the data should be able to return results in milliseconds if the smaller columns are being used a better if the row-group stats can be used to answer queries. You can host those Parquet files on a local disk or S3. A local disk might be cheaper if this is exposed to the outside world as well as giving you a price ceiling on hosting. If you have a Parquet file with billions of records and row-groups measuring into the thousands then hosting on something like Cloudflare where there is a per-request charge could get a bit expensive if this is a popular dataset. At a minimum, DuckDB will look at the stats for each row-group for any column involved with a query. It might be cheaper just to pay for 400 GB of storage with your hosting provider. There is a project to convert OSM to Parquet every week and we had to look into some of those issues https://github.com/osmus/layercake/issues/22 |
|