|
|
|
|
|
by awild
1489 days ago
|
|
Question for people who've implemented these kind of compression schemes: some of these mechanisms require local context (rle and s8b) how do you handle random access in these cases? Is the metadata embedded as a sort of key frame to which we have to advance before decoding the value? Or as a separate block in the headers? RLE especially sounds like it could degenerate into a binary search per lookup, maybe aided by caching the bounds of the last run and assuming locality and linear-ish usage? This might sound obvious, but wasn't mentioned in the article, you can also apply integer based compression schemes on dictionary encoded data. And floating point values don't always need those 64bits when the use case and input data don't require that level of precision. |
|
Instead you're typically reading a range of data, and then you can decompress just the blocks required for the data you want to see.
Caching of partial queries can also help substantially. For example, if many queries involve querying the max() of some per-second data grouped by minute, it is well worth caching that rather than reading the source data every time to calculate the max().
Typically the query engine can keep counts of every subquery and how frequently it's used and how many data points it involves to decide how long to cache it for. As far as I'm aware no opensource tsdb does this, despite it being a massive simple win, especially for alerting systems and dashboards that run very similar queries frequently.