Hacker News new | ask | show | jobs
by awild 1489 days ago
Question for people who've implemented these kind of compression schemes: some of these mechanisms require local context (rle and s8b) how do you handle random access in these cases? Is the metadata embedded as a sort of key frame to which we have to advance before decoding the value? Or as a separate block in the headers?

RLE especially sounds like it could degenerate into a binary search per lookup, maybe aided by caching the bounds of the last run and assuming locality and linear-ish usage?

This might sound obvious, but wasn't mentioned in the article, you can also apply integer based compression schemes on dictionary encoded data.

And floating point values don't always need those 64bits when the use case and input data don't require that level of precision.

3 comments

Most applications don't require true random access.

Instead you're typically reading a range of data, and then you can decompress just the blocks required for the data you want to see.

Caching of partial queries can also help substantially. For example, if many queries involve querying the max() of some per-second data grouped by minute, it is well worth caching that rather than reading the source data every time to calculate the max().

Typically the query engine can keep counts of every subquery and how frequently it's used and how many data points it involves to decide how long to cache it for. As far as I'm aware no opensource tsdb does this, despite it being a massive simple win, especially for alerting systems and dashboards that run very similar queries frequently.

I was thinking the same thing as I was reading this. I doubt you could retain 100% random access. Instead, I think you would generally create "blocks" that are compressed in a time range that is compatible with your application (ie. 1 day blocks or 1 week blocks, etc.) and then when picking date/times you take X blocks and then further refine the query after compression (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At least, that is what I have done in the past in my own home grown apps. Essentially each block then starts and terminates compression - # of blocks is a trade off in compression efficiency vs. granularity.
Correct, almost all timeseries databases divide the data in shards / partitions / whatever-it’s-called, which are then split by column, which are then compressed as a single unit.

Some databases use a fixed block size (eg as you mention, “1 day”), which are simple and stateless to manage, while others dynamically “split” blocks into smaller blocks (frequently called “ranges”), or merge them back later. The latter is significantly more complex, but is a much better approach for varying workloads where you don’t know the right shard size in advance, or need to deal with the possibility of highly varying workloads, eg you have a lot of traffic on specific time of day / day of week/month/year.

> (...) how do you handle random access in these cases?

Given we're discussing time series, random access is something that's not required at all. There is no use case where you're looking for a single, specific data point of a time series. Time series are used to store and retrieve batches of data points in order to process and/or analize/visualize in batches. Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B".

> There is no use case where you're looking for a single, specific data point of a time series.

I'm no expert but it seems trivial to come up with some. The article talked of storing temperature and humidity, so in relation to that:

- "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period. Having to start at the dawn of time is kind of a pain

- "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"

- Record highs are not going to be at night, so if you want to find that, we can skip a lot of data there. Now, temperatures don't change every millisecond so it's not a lot of data even if you would have to start decompressing 50 years back, but in general, this kind of optimization is only possible if you have random access. (A better example here might be if you have petabytes of LHC data compressed this way and want to find something in it.)

Random access is not the common case but not a bad feature to have, either. The comment kind of acknowledges this by mentioning batches, but starts by dismissing the problem

The commenter you're responding to isn't saying that it's literally impossible to retrieve a single point without doing a full scan of the data, but rather that it's not the top priority of these data structures to support queries for individual points.

The GP comment even says:

> Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B"

which seems exactly like what you mentioned with "$now-$period".

These kinds of data structures, often found in OLAP databases, assume that point queries will be less common, and they accept a bit higher latencies for those queries. But those latencies are still fairly small.

> "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period.

Your example relies on a query to retrieve all data points between timestamp A (X days from now) and B (now).

> - "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"

What's your definition of then?

In time series, it's typically a statistic calculated from all data points within a time window. It might be max, it might be a percentile, it might be a summary statistics of some kind.