| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BFLpL0QNek 2678 days ago

Depending on your size / budget / needs Snowflake may interest you. https://www.snowflake.com/product/architecture/.

I haven't used it but have been given a presentation by them on it, and it was very very good.

They store data in S3 and use FoundationDB for indexes. You can feed it JSON and it'll index it and let you query it on a massive scale shockingly fast.

Obviously they are not aimed at small hobby projects but if your project has money / serious product depending on your needs it's well worth looking at.

On the S3 cheaper / smaller end you can batch up data daily / weekly etc. So the landing bucket acts as a queue that gets processed creating daily batch files from the small files aggregated together. You can then take the daily batches to create weekly batches etc etc, essentially partitioning. This will reduce the total number of files needed to query. If you use deterministic names based on how you plan to query this can also reduce the number of files you need to list / parse. When batching / re-partitioning the data you can also use the Apache Parquet format to compress a little better + also import in some of the querying tools out there.