Hacker News new | ask | show | jobs
by ryzhyk 602 days ago
The correct way to think about the problem is in terms of evaluating joins (or any other queries) over changing datasets. And for that you need an engine designed for *incremental* processing from the ground up: algorithms, data structures, the storage layer, and of course the underlying theory. If you don't have such an engine, you're doomed to build layer of hacks, and still fail to do it well.

We've been building such an engine at Feldera (https://www.feldera.com/), and it can compute joins, aggregates, window queries, and much more fully incrementally. All you have to do is write your queries in SQL, attach your data sources (stream or batch), and watch results get incrementally updated in real-time.

4 comments

Is it related to Differential Dataflow / timely dataflow https://github.com/TimelyDataflow/differential-dataflow
We have our own formal model called DBSP: https://docs.feldera.com/papers

It is indeed inspired by timely/differential, but is not exactly comparable to it. One nice property of DBSP is that the theory is very modular and allows adding new incremental operators with strong correctness guarantees, kind of LEGO brick for incremental computation. For example we have a fully incremental implementation of rolling aggregates (https://www.feldera.com/blog/rolling-aggregates), which I don't think any other system can do today.

Fast rolling aggregates are swell. I meet a lot of people who are building trading systems and want this sort of thing, but it usually isn't a great choice because the perfectly rectangular kernel probably isn't the best possible version of the feature and because arbitrary kernels can be well approximated using a state of constant size rather than a large buffer storing a sliding window.
Are you aware of any efforts to apply DBSP's theory to a general programming language/environment? From my perspective, DDlog was the most inspiring project in the field of incremental computation, but it seems like all of these projects just lead to implementations of streaming databases or other similar commercial products that fit into Data™ pipelines (no offense). Incremental computation pops up everywhere, from databases to business logic to UI rendering and video game graphics, and I have this hunch that if the problem could be solved at a fundamental level and in an accessible way, we could have revolutionary gains for programmers and programs.
Thanks for the kind words about DDlog :)

The reason DBSP and Differential Dataflow work so well is because they are specialized to relational computations. Relational operators have nice properties that allow evaluating them incrementally. Incremental evaluation for a general purpose language like Rust is a much, much harder problem.

FWIW, DBSP is available as a Rust crate (https://crates.io/crates/dbsp), so you can use it as an embedded incremental compute engine inside your program.

Indeed. I've experimented a bit with abusing DD/DBSP for my purposes by modeling various kinds of data structures in terms of Z-sets, but these efforts have not yielded very impressive results. :)

For how elegant DBSP is I still found the paper a tough nut to crack, and it really is one of the more accessible theoretical contributions in the space, at least from this grubby programmer's perspective... I hope to devote some time to study and play around more, but in the meantime I'm rooting for you!

Thanks again!

You may want to check out this tutorial for a hands-on introduction to DBSP: https://docs.rs/dbsp/0.28.0/dbsp/tutorial/index.html

Also there's now a DBSP implementation in pure Python! https://github.com/brurucy/pydbsp
(Not who you are replying to) Not sure if it’s specifically related to DBSP but checkout incremental DataFun (slide ~55 of https://www.rntz.net/files/stl2017-datafun-slides.pdf) and the paper cited there: A Theory of Changes for Higher Order Languages: Incrementalizing Lambda-calculi by Static Differentiation (Cai et. al, PLDI 2014).
Thank you, I'll add these to my reading list!
Does this offer any non-SQL programmatic interfaces or ways to do Complex Event Processing (e.g. https://www.databricks.com/glossary/complex-event-processing )? A lot of those scenarios would be tough to express in SQL.
Yes, you can write Rust UDFs with Feldera and even use the dbsp crate directly if you'd like.
Sorry, but I don't see much on https://docs.feldera.com/sql/udf/ that suggests how CEP would work. The example is single-valued and doesn't indicate if state or windows for CEP would be supported.
Hi, I’ve read the DBSP paper and it’s a really well-thought out framework; all the magic seemed so simple with the way the paper laid things out. However, the paper dealt with abelian Z-sets only, and mentioned that in your implementation, you also handle the non-abelian aspect of ordering. I was wondering if you guys have published about how did you that?
Apologies about the confusion. We indeed only solve incremental computation for Abelian groups, and the paper is making a case that database tables can be modeled as Abelian groups using Z-sets, and all relational operators (plus aggregation, recursion, and more) can be modeled as operations on Z-sets.
Yes, I might have misworded my question. My question is in relation to this paragraph on page 12:

"Note that the SQL ORDER BY directive can be modelled as a non-linear aggregate function that emits a list. However, such an implementation is not efficiently incrementalizable in DBSP. We leave the efficient handling of ORDER BY to future work."

My understanding is that Feldera does indeed support ORDER BY, which I imagine it does incrementally, thus my question.

The statement in the paper that ordering is not efficiently incrementalisable seems to makes sense to me. It is clear that even though Z-sets are not naively able to represent diffs of ordered relations (since Z-sets are unordered), ordering can be modelled as an aggregate that first builds up the first row, then the second row, and so on. Even as formulated this way however, I fail to see how the entire "incrementalised" computation would still be practically incremental, in the sense that the size of the output diff (Z-set) is small as long as the diff of the input is small.

For example, consider the query `select x from y order by x asc`, and the following values respectively occur in the stream of x: 5, 4, 3, 2, 1. Now, consider the incremental diff for the last value of 1. If presumably one models order by a list aggregation, then the Z-set for the entire computation seems to be

    - [ 2, 3, 4, 5 ]
    + [ 1, 2, 3, 4, 5 ]
which grows with the size of the output set rather than the size of the input diff. If presumably one models order by e.g. adding an order column, the diff would be

    - 2 (index: 0)
    - 3 (index: 1)
    - 4 (index: 2)
    - 5 (index: 3)
    + 1 (index: 0)
    + 2 (index: 1)
    + 3 (index: 2)
    + 4 (index: 3)
    + 5 (index: 4)
which once again varies with the size of the output set.
Your explanation of why ORDER BY is not efficiently incrementalizable is spot on. At the moment Feldera ignores the outermost ORDER BY clause, unless it is part of the ORDER BY ... LIMIT pattern, which is SQL's way to express the top-k query.

So, a better solution is still future work :)

I was with you and thinking Postgres over and over until the second paragraph. Which isn’t to say anything bad about your product, it sounds very cool.

But i’d work in “just like Postgres”.

Good point. The goal is indeed to be a Postgres of incremental computing: any SQL query should "just work" out of the box with good performance and standard SQL semantics. You shouldn't need a team of experts to use the tool effectively.