Hacker News new | ask | show | jobs
by nickpeterson 3348 days ago
I'm no expert, but I believe the Crux of the issue is how data is naturally organized and stored. In a row oriented database, most data is stored in pages that contain rows. There are often indexes with ordering, but unless a secondary index contains all the values needed (often called a covering index), the entire row must be retrieved to answer any query that uses that information.

Most Time-series databases are columnar in nature, and often have the concept of time baked into the ordering of values (think vectors not sets). Because they are columnar, they are more trivial to retrieve just the data needed by a query. Suddenly instead of loading a billion rows and averaging the value in one column, you're just accesing the column itself to answer the question. From an IO perspective that's a huge savings.

Now imagine you have special, dark arts for working on compressed data, and a query optimizer you've been tuning for a decade for demanding clients. It does not surprise me that kdb is much faster than the open source competitors. And to be fair, even with excellent traditional databases like postgres, I bet Oracle, db2, and Ms SQL are still generally faster in most queries.

1 comments

> Suddenly instead of loading a billion rows and averaging the value in one column, you're just accesing the column itself to answer the question. From an IO perspective that's a huge savings.

The other side of this is that writing out data in that form would naively be an iop per sample, as you're usually appending one sample to 100s of time series in one request.

A significant part of monitoring TSDB design is buffering up samples and batching writes in order reduce that iop rate to something sane.

For example in the right circumstances the 1.x Prometheus design can ingest 250k samples/s on a hard disk which provides ~100 iops/s.