Hacker News new | ask | show | jobs
by lucb1e 1495 days ago
> There is no use case where you're looking for a single, specific data point of a time series.

I'm no expert but it seems trivial to come up with some. The article talked of storing temperature and humidity, so in relation to that:

- "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period. Having to start at the dawn of time is kind of a pain

- "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"

- Record highs are not going to be at night, so if you want to find that, we can skip a lot of data there. Now, temperatures don't change every millisecond so it's not a lot of data even if you would have to start decompressing 50 years back, but in general, this kind of optimization is only possible if you have random access. (A better example here might be if you have petabytes of LHC data compressed this way and want to find something in it.)

Random access is not the common case but not a bad feature to have, either. The comment kind of acknowledges this by mentioning batches, but starts by dismissing the problem

2 comments

The commenter you're responding to isn't saying that it's literally impossible to retrieve a single point without doing a full scan of the data, but rather that it's not the top priority of these data structures to support queries for individual points.

The GP comment even says:

> Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B"

which seems exactly like what you mentioned with "$now-$period".

These kinds of data structures, often found in OLAP databases, assume that point queries will be less common, and they accept a bit higher latencies for those queries. But those latencies are still fairly small.

> "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period.

Your example relies on a query to retrieve all data points between timestamp A (X days from now) and B (now).

> - "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"

What's your definition of then?

In time series, it's typically a statistic calculated from all data points within a time window. It might be max, it might be a percentile, it might be a summary statistics of some kind.