Hacker News new | ask | show | jobs
by zeptomu 3348 days ago
I have to admit that I did not know time-series databases are a thing and just recently realized that because they came up more often. Unfortunately I do not have an answer for you, but I know traditional DB systems are hard to build, as we expect more reliability and guarantees from them compared to other (daemon) software.

However I am interested to know why this kind of data and/or problems require specific software and why it can't be handled by traditional RDBMs? Obviously you could model the domain with a classic database, but seemingly there exist important queries that can't be satisfied (at least timely) through classic systems - what kind of queries are these, and why do they fail using a general-purpose DB approach?

3 comments

I'm no expert, but I believe the Crux of the issue is how data is naturally organized and stored. In a row oriented database, most data is stored in pages that contain rows. There are often indexes with ordering, but unless a secondary index contains all the values needed (often called a covering index), the entire row must be retrieved to answer any query that uses that information.

Most Time-series databases are columnar in nature, and often have the concept of time baked into the ordering of values (think vectors not sets). Because they are columnar, they are more trivial to retrieve just the data needed by a query. Suddenly instead of loading a billion rows and averaging the value in one column, you're just accesing the column itself to answer the question. From an IO perspective that's a huge savings.

Now imagine you have special, dark arts for working on compressed data, and a query optimizer you've been tuning for a decade for demanding clients. It does not surprise me that kdb is much faster than the open source competitors. And to be fair, even with excellent traditional databases like postgres, I bet Oracle, db2, and Ms SQL are still generally faster in most queries.

> Suddenly instead of loading a billion rows and averaging the value in one column, you're just accesing the column itself to answer the question. From an IO perspective that's a huge savings.

The other side of this is that writing out data in that form would naively be an iop per sample, as you're usually appending one sample to 100s of time series in one request.

A significant part of monitoring TSDB design is buffering up samples and batching writes in order reduce that iop rate to something sane.

For example in the right circumstances the 1.x Prometheus design can ingest 250k samples/s on a hard disk which provides ~100 iops/s.

There is also timescale which is built atop Postgres. Or rather, modifies Postgres' query engine etc., this is not some rails app!

https://blog.timescale.com/when-boring-is-awesome-building-a...

> ... why this kind of data and/or problems require specific software and why it can't be handled by traditional RDBMs?

There's Tgres, that stores time series in Postgres, efficiently (using Postgres arrays and views):

https://github.com/tgres/tgres

https://grisha.org/blog/2017/01/21/storing-time-seris-in-pos...