| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dan-robertson 670 days ago
	This sort of general architecture (store parquet-like files somewhere like s3 and build a metadata database on top) seems reasonably common and gives obvious advantages for storing lots of data, scaling horizontally, and scaling storage and compute separately. I wonder where you feel your advantages are compared to similar systems? Eg is it certain API choices/affordances like the ‘time travel’ feature, or having in-house expertise or some combination of features that don’t usually come together? A slightly more technical question is what your time series indexes are? Is it about optimising storage, or doing fast random-access lookups, or more for better as-of joins?

2 comments

jjmunro 669 days ago

We do have a specialist time-series index, optimised for things like tick-data. It compresses fairly well but we generally optimise for read-time. Not all over the place random-access, but slicing out date-ranges. There are two layers of index, a high level index of the data-objects, and the index in each object in S3.

A built-in as-of join is something we want to build.

link

joewood1972 670 days ago

For example, Apache Iceberg is exactly this. Complete with bitemporal query support.

link

dan-robertson 670 days ago

I feel like ‘exactly’ is doing a lot of work in your comment and I am interested in the reasons that that word may not be quite the right word to describe these situations.

link