Hacker News new | ask | show | jobs
by ghc 2782 days ago
While I understand your point, you are quite mistaken if you think that time is just another key. Dealing with time properly requires a concept of point distance, similar to GIS systems requiring 2d distance understanding. You cannot do joins on time with SQL databases unless you want to throw away important data.

As an example, in the industry I work in, you may have no readings for days or weeks, and then hundreds of readings from the same sensor. Why? Many systems in industrial environments send new readings only "on-change", and assume the underlying data storage architecture will forward fill to in-between times. This is why the practically ancient time series architecture of data historians still dominates in these use cases.

In fact, for many time series functions you actually have to throw away the notion of relational joins to be able to efficiently perform time-preserving joins. Window functions only work in basic use cases with relatively small amounts of data where you're aggregating.

2 comments

Vertica has this functionality and it has been there for years. Fully functional database and you can do time series joins, gap filling with linear interpolation or constant. You can define the intervals at what you want the data points. And you can scale from a few gigs to petabytes of data. https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/...
I agree! Vertica's temporal capabilities are marvelous and the engineers who worked on the planner optimizations for the time extensions are brilliant.

There are of course trade-offs to the approach Vertica takes -- look at StreamBase for a very different take on the problem, another Stonebraker project.

Any of course historians represent yet another take, optimized for point-in time queries that are native and don't need the processing extensions Vertica uses.

I'm not sure where the complexity is that you claim, nor what it has to do with data historians?

So what if there are missing rows? This doesn't affect the database and any aggregations will work fine. Databases don't "fill-in" data, but you can definitely write whatever SQL you need to create averages and buckets to smooth out results.

From reading your website, it seems you're talking about the "last value recorded" as of a certain time, which doesn't seem to be a common query but is totally possible. KDB+ has "asof" joins and others can handle it with window functions using last_value().

We run queries on a table containing 2.7+ trillion rows of data that has no set pattern and infinite cardinality, and results return within seconds. Window functions and joins work without issue. Have you actually tried using a columnstore?

Relational databases represent a column/row-oriented architecture. Data historians are a specialized, non-relational, time-oriented architecture. Using time as a key in a relational index implies that only ordering is important, but that is not the case. Distance between points in time is extremely important because time operates on a continuous 1d line and data points are represented at varying distances between each other on that line. Data historians are architected to both preserve this temporal relationship and take advantage of this by eliminating duplicate data, employing temporal compression techniques to be able to store millions of readings per second for years worth or data.

> From reading your website

My website doesn't have much to do with this because Sentenai isn't a time series database system. I did, however, spend most of my time in research working on temporal data systems, and have been fortunate to collaborate with or learn from researchers who have spent decades solving the unique problems that temporal data presents. What you might consider uncommon for your use cases is extremely common in manufacturing, defense and other areas.

There's a decades-old industry around database systems that handle time natively. And while many support SQL as a lingua franca, and some are column stores, they're not relational by any means as they either extend SQL to support time, or limit non-temporal joins to ensure performance. StreamBase, Kdb, Aurora and many other specialized architectures exist because one size does not fit all. Michael Stonebraker, whose work has included StreamBase, Vertica, Tamr, Postgres, Aurora, and many others, famously published this paper about the very problem: https://cs.brown.edu/~ugur/fits_all.pdf .

Further reading that might be illuminating:

  1. http://cs.brown.edu/research/aurora/vldb03_journal.pdf
  2. http://www.cs.rochester.edu/u/james/Papers/AllenFerguson-events-actions.pdf
  3. https://books.google.com/books?id=BK6oCAAAQBAJ&pg=PA9&source=gbs_toc_r&cad=4#v=onepage&q&f=false (excerpt)
I appreciate your links to further reading, and I'm trying to read the Aurora paper right now but after reading the abstract and the intro (I'm in progress right now), I can't find a case that is uniquely fit/perfect for data historians... I know this is already asking a lot, but would you mind giving me one go-to- use case that really made you think "this is what purpose-built data historian-style databases are good for?".

Every issue mentioned in the abstract/intro (which are meant to motivate the paper) seems like it can be solved as an add-on to existing application databases (albeit with their most recent developments/capabilities in mind). The very description of HADP vs DAHP systems seems silly, because it's just a question of write load, and that's fundamentally only solved with batching and efficient IO, or if you give up durability, it doesn't seem inherent to the data model. There's also assertions like:

> Moreover, performance is typically poor because middleware must poll for data values that triggers and alerters depend on

But like, postgres though, you're free to define a better/more efficient LISTEN/SUBSCRIBE based trigger mechanism, for example, you can highly optimized code right in the DB... Thinking of some of the cases called out in the paper here's what I think in my head:

- Change tracking vs only-current-value -> just record changes/events, as far as tables getting super big, partitioning helps this (timescaledb does this)

- Backfilling @ request time -> an postgres extension could do this

- Alerting -> postgres does have customizable functions/procedures as well as LISTEN/SUBSCRIBE. The paper is right (?) about TRIGGERs not scaling then this might be the most reasonable point.

- Approximate query answering is possible with postgres with stuff like HyperLogLog, but the paper is certainly right in that it is not implemented by default.

Maybe I'm mistaking the extensibility of postgres for the redundancy of the paradigm, akin to thinking something like "lisp is multi-paradigm so why would I use Haskell for it's enhanced inference/safety".

I'm still reading the paper so maybe by the end of it it will dawn on me.

So Aurora isn't a historian, but is a complex event processing system. It's an entirely different beast that solves very specific problems around high-speed queries that could theoretically require scanning through all data stored historically for queries.

I'm not a huge fan of historians (I've spent too much of my career working with them), but I can definitely tell you where they make sense. The scenario is this:

Imagine you have a large facility with thousands of machines, each with a programmable logic chip for controls and monitoring. These machines create lots of data and so often employ data reduction semantics by reducing data to on-change rather than sampling sensors at thousands of hertz. A single machine may have dozens or hundreds of variables to track. These tags might be hierarchical: Machine 1, subsystem 5, variable b. If you say there's 100,000 total tags to track in the facility, and they're on average sampled at 10hz, you need a system capable of writing a million durable timestamped values per second. Now that's child's play for, say, google, but if you're a manufacturer, you can't afford to spend massive amounts of money on cloud systems, and usually want to do this all on a single server on the factory floor because you need realtime monitoring that can display the current value in time for every single tag. ( https://www.ws-corp.com/LiveEditor/images/SLIDES/10/3.jpg ). Ideally, in a single node scenario, you want compression. It's not uncommon to store 100 billion timestamped values per day and keep them for a year or more for audit purposes is something goes wrong. Today, for the sake of predictive maintenance, data retention policies of up to 10 years are becoming more common.

So what would you sacrifice to be able to do efficient realtime monitoring and ingestion of millions of data points per second? You can't use queueing semantics to protect an RDBMS because logging can't take more than a 10th of a second per point. If you think about the use case, what you'd sacrifice is transactional queries and row-level joins, because you just don't need them. At the same time, this data is really sparse when you look at it from a table's perspective, so you'll want something like a column store to underly the data storage.

So what we do is throw out transactional guarantees, choose a storage system that is good at compression (roll-ups in some historians will store a formula approximating the data instead of raw data itself over a window), and prioritize speed of point retrieval for most recent "hot data" by caching it in-memory.

You can of course extend Postgres to achieve many of these things, but having done it myself, in practice it's sub-optimal in the exact same way that using bubble sort for all your programmatic sorting needs is sub-optimal.

One thing you might want to keep in mind is that many of the people involved in Aurora are the authors of Postgres. They're not arguing you can't do things in Postgres, they're arguing that in practice the RDBMS's guarantees are theoretically incompatible with high performance in the area of Complex Event Processing, because alignment between different simple events (recorded as rows in a database) can drift so far that memory requirements become prohibitive if you don't use a stream-processing architecture.

Also keep in mind that Aurora is from 2002 and many of the ideas have been implemented elsewhere over time. The great thing about Postgres is that it's perfect scaffolding on which you can build other stuff.

Thanks for taking the time to give such a detailed answer and explanation on historian systems.
Most data/operational historians are separate programs on top of databases, so is that what you're actually talking about? Your papers seem to suggest that.

Stonebraker is talking about OLTP vs OLAP. I agree that they are very different scenarios.

Over the past 10 years I've had the misfortune of working with most major historians in the field (Oracle's the biggest one I never encountered) and not a single one implemented process data storage on top of a relational database. Some used relational databases to store asset metadata, so maybe that's what you're thinking of?

If you read all the way to the end Stonebraker is actually advocating for specialized architectures like, array databases (SciDB) and stream processing engines like StreamBase, which at the time was just gaining GUI-based query creation capabilities because it was difficult to teach its non-relational concepts to SQL users.

You don't have "missing rows". For time series A you have a time point at 12:01 AM, a datapoint at 12:02 AM, and another datapoint two weeks later at 5:04 PM. For time series B the times are different. You need some notion of whatever state the physical system was in at any given time.
Yes, I understand this as the "last value recorded" concept in my comment. KDB+ supports this with "asof" joins. Others can just do it by scanning a wider time frame or the entire table.

Null gaps in a columnstore can be skipped over basically instantaneously and usually are just zone map/index lookups. Again I question how common this query is and whether it's really worth limiting yourself to a special TSDB because of it.

> Others can just do it by scanning a wider time frame or the entire table.

"Scanning the entire table" for every request to have the last value recorded is rarely a practical option.

> KDB+ supports this with "asof" joins.

> [...]

> Again I question how common this query is and whether it's really worth limiting yourself to a special TSDB because of it.

KDB literally markets itself as a time series database. What's the point you're making again?

I think TimescaleDB lacks an "asof" function for now, but it makes up for it by having the full power of PostgreSQL for other stuff. Regardless, Time Series databases like KDB and TimescaleDB are useful.

EDIT:

it looks like TimescaleDB recommends using

  ORDER BY time DESC LIMIT 1
to get the most recent value for any particular set of sources that you're SELECTing over, which would use indices and be reasonably fast.
> KDB literally markets itself as a time series database

kdb+ also markets itself as OLAP/OLTP as well.

A lot of these guys market themselves as a "time series database" because kdb+ do and they want to be compared with kdb+ by people who haven't used kdb+ (but might be considering it).

Distributed relational column-oriented databases are best at large data volumes and OLAP queries. KDB+ is one of those, even though they call it a TSDB in marketing terminology because of its FinTech customer base.

TimescaleDB is not a TSDB, it's an extension to add automatic partitioning to PostgreSQL tables. Timescale helps Postgres get more performance, but it does not give you the full capabilities of a real distributed column-oriented system. If you must use PostgreSQL though then it's a good compromise.

The query you posted does not match the discussion about the last value at a specific instant in time, only the last value ever recorded in the table for that key.

> Distributed relational column-oriented databases are the best at large OLAP data volumes and queries. KDB+ is one of those, even though they call it a TSDB in marketing terminology because of its FinTech customer base.

You're mistaken about Kdb's relational features. Kdb was designed as a time series processing engine using arrays (columns). Column storage doesn't have anything to do with whether a database is relational or not, and Kdb wasn't originally any more relational than the language Erlang is.

In our case, "last value" isn't good enough. We do interpolation, and use compression algorithms for which interpolation minimizes reconstruction error.