Hacker News new | ask | show | jobs
by nikhilsimha 630 days ago
dumb question: how do z-sets or feldera deal with updates to values that were incorporated into the max already?

For example - max over {4, 5} is 5. Now I update the 5 to a 3, so the set becomes {4, 3} with a max of 4. This seems to imply that the z-sets would need to store ALL the values - again, in their internal state.

Also there needs to be some logic somewhere that says that the data structure for updating values in a max aggregation needs to be a heap. Is that all happening somewhere?

2 comments

We use monotonicity detection for various things. I believe (can double check) that it's used for max as well. But you're correct that in the general case, max is non-linear, so will need to maintain state.

Update from Leonid on current implementation: each group is ordered by the column on which we compute max, so it's O(1) to pick the last value from the index.

So the writes are O(N) then - to keeps reads at O(1)?
Both reads and writes are O(1) in time complexity. Writes additionally have the log(N) amortized cost of maintaining the LSM tree.
gotcha! thanks for the clarification
Just a guess... wouldl like to hear the answer as well.

they probably have a monotonicity detector somewhere, which can decide whether to keep all the values or discard them. If they keep them, they probably use something like a segment tree to index.

That's right, we perform static dataflow analysis to determine what data can get discarded. GC itself is done lazily as part of LSM tree maintenance. For MAX specifically, we don't have this optimization yet. In the general case, incrementally maintaining the MAX aggregate in the presence of insertions and deletions requires tracking the entire contents of the group, which is what we do. If the collection can be proved to be append-only, then it's sufficient to store only the current max element. This optimization is yet coming to Feldera.
Yes, we do a lot of work with monotonicity detection. It's central to we perform automatic garbage collection based on lateness.