Hacker News new | ask | show | jobs
by valyala 56 days ago
Interesting solution! According to the provided numbers at "query latency" chapter, the query over cold data, which selects samples for 497 time series over 6 hours time range takes 15 seconds if the queried data isn't available in the cache. This means that typical queries over historical data will take eternity to execute ;(
1 comments

yes. this is current issue. there are two solutions:

1. the reason it's slow as you select more series over longer periods of time is that the series has to be pulled for each time bucket in the range, and then the samples have to be pulled for each bucket. By compacting older buckets and merging samples together, historical queries should be pretty comparable to 'more recent' cold queries. 2. We don't pre-cache all the metadata today. If we did that, then we could parallelize sample loads much more efficiently, lowering latency. 3. There is a lot of room to do better batching and tune the parallelism of cold reads.

We've only been at this for a couple of months. THe techniques to improve latency on object storage are well known, we just have to implement them.

Another benefit is this: all the data is on S3, so spinning up more optimized readers to transform older data to do more detailed analysis is also an option with this architecture.

Yes, there is a solution for masking the read latency at object storage - to run many readers in parallel. I tweeted about it some time ago - https://x.com/valyala/status/1965093140525715714
The other solution is to aggressively size your disk cache and keep effectively the full working set on disk, using object storage just as a durability layer. Then the main benefit is operational simplicity because you have a true shared-nothing architecture between the read replicas (there's no quorum or hash ring to maintain and no deduplication on read). Obviously you'll have a more expensive deployment topology if you do so, but it's still compelling IMO because you have the knobs to tune whether you want to cache on disk or not.
+1 to what @agavra said. It's awesome to see you here @valyala. Your writing and talks about timeseries databases were a great inspriratino for us. I recall one of your earlier talks about the data layout design of VM. Opendata Timeseries has emulated a lot of it.
also super cool to see you on here valyala! we took a bunch of inspiration from your work at VM. kudos to all you've done :)