Hacker News new | ask | show | jobs
by brsg 1975 days ago
You're asking this in good faith, so you don't deserve to get downvoted. I think people on hn are a bit sensitive to comments they perceive as "reductionist" that may oversimplify complex problems, even if they're honest questions.

But you're right, at it's core this kind of problem will use techniques like memoization, and most projects that need something like this will have adhoc approaches to solve this problem. The advantage of differential data flow is that it's a generalized approach to this problem. The business logic behind these workflows, between tracking dependencies and updates can get pretty damn complicated and difficult to maintain. Having a generalized approach would make building these dataflows much simpler.

The paper it's based on is pretty skimmable, so I recommend taking a look at it. https://raw.githubusercontent.com/TimelyDataflow/differentia...

1 comments

Thanks for linking the paper.

I think I maybe a bit environmentally damaged from mainly using Clojure. Algorithms similar to this are fairly common in the Clojure ecosystem. Memoization is part of the standard library too.

The claim (in the article) that no one cares about Differential Dataflow seems to be only true when talking about this specific library. The general concept surely translates to some combination of simple concepts like memoization, topological sorting, partial application, etc. so it's obvious to me that many adhoc implementations would exist tailored to more specific needs in a different programming languages with different feature sets. Sometimes buying into a framework is a lot more work than rolling your own, especially if it means having to switch to a different programming language.

Importantly, this doesn't just use memoization (it actually avoids having to spend memory on that), but rather uses operators (nodes in the dataflow graph) that directly work with `(time, data, delta)` tuples. The `time` is a general lattice, so fairly flexible (e.g. for expressing loop nesting/recursive computations, but also for handling multiple input sources with their own timestamps), and the `delta` type is between a (potentially commutative) semigroup (don't be confused, they use addition as the group operation) and an abelian group. E.g. collections that are iteratively refined in loops often need an abelian `delta` type, while monoids (semigroup + explicit zero element) allow for efficient append-only computations [0].

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2019...