| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by galeaspablo 426 days ago

Apologies, we've been so deep into this problem that we take our slang for granted :)

A graphical representation might be worth a thousand words, keeping in mind it's just one example. Imagine you're traversing the following.

A1 -> A2 -> A3...

B1 -> B2 -> B3...

C1 -> C2 -> C3...

D1 -> D2 -> D3...

E1 -> E2 -> E3...

F1 -> F2 -> F3...

...

Efficient concurrent consumption of these messages (while respecting causal dependency) would take O(w + h), where w = the _width_ (left to right) of the longest sequence, and h = the _height_ (top to bottom of the first column)

But Pulsar, Kafka + parallel consumer, Et al. would take O(n^2) either in processing time or in space complexity. This is because at a fundamental level, the underlying data storages store looks like this

A1 -> A2 -> A3...

B1 -> B2 -> B3...

C1 -> C2 -> C3...

D1 -> D2 -> D3...

E1 -> E2 -> E3...

F1 -> F2 -> F3...

Notice that the underlying data storage loses information about nodes with multiple children (e.g., A1 previously parented both A2 and B1)

If we want to respect order, the consumer will be responsible for declining to process messages that don't respect causal order. E.g., attempting to process F1 before E1. Thus we could get into a situation where we try to process F1, then E1, then D1, then C1, then B1, then A1. Now that A1 is processed, kafka tries again, but it tries F1, then E1, then D1, then C1, then B1... And so on and so forth. This is O(n^2) behavior.

Without changing the underlying data storage architecture, you will either:

1. Incur O(n^2) space or time complexity

2. Reimplement the queuing mechanism at the consumer level, but then you might as well not even use Kafka (or others) at all. In practice this is not practical (my evidence being that no one has pulled it off).

3. Face other nasty issues (e.g., in Kafka parallel consumer you can run out of memory or your processing time can become O(n^2)).

2 comments

taurath 425 days ago

Wanted to say thanks so much for writing this all out - I've always thought of ordering as being sort of inherently against the point of parallel streams, so its interesting to hear about the state of the art and the benefits that are trying to be gleaned! I'm not thinking in stream processors terribly often so I wasn't aware of how dependencies are mapped.

If you don't mind another followup (and your patience with my ignorance hasn't run out :P), wouldn't the efficient concurrent consumption imply knowing the dependency graph before the events are processed? IE, is it possible in any instance to get to O(w+h) in a stream?

link

galeaspablo 424 days ago

No problem. :)

Yes, order needs to be known.

So no, it’s not possible to do O(w+h) with streams partitioned by key. Unless, of course you use a supplementary index, but then you might as well not use the streams storage at all and store the records in the same storage as the index.

It’s worth noting that Pulsar does something like this (supplementary way to keep track of acknowledged messages), but their implementation has O(n^2) edge cases.

link

singron 426 days ago

Do you have an example use case for this? This does seem like something unsuited to kafka, but I'm having a hard time imagining why you would structure something like this.

link

galeaspablo 426 days ago

Great follow up question, thank you. I could talk about this "topic" for days, so I appreciate the opportunity to expand. :)

Let's imagine ourselves as a couple of engineers at Acme Foreign Exchange House. We'd like to track Acme's net cash position across multiple currencies, and execute trades accordingly (e.g., heding). And we'd like to retrospectively analyze our hedges, to assess their effectiveness.

Let's say I have this set of transactions (for accounts A, B, C, D, E, F, etc.)

A1 -> A2 -> A3 -> A4

B1 -> B2 -> B3 -> B4

C1-> C2

D1 -> D2 -> D3 -> D4

E1 -> E2

Let's say that that:

- E1 was a deposit made into account E for $2M USD.

- E2 was an outgoing transfer of $2M USD sent to account F (incoming £1.7M GBP at F1).

If we consume our transactions and partiton our consumption by account id, we could get into a state where E1 and F1 are reflected in our net position, but E2 isn't. That is, our calculation has both $2M USD and £1.7M GBP, when in reality we only ever held either $2M USD or £1.7M GBP.

So what could we do?

1. Make sure that we respect causality order. I.e., there's no F1 reflected in our net position if we haven't processed E2.

2. Make sure that pairs of transactions (e.g., E2 and F1) update our net position atomically.

This is otherwise known as a "consistent cut" (see slide 25 here https://www.cs.cornell.edu/courses/cs6410/2011fa/lectures/19...).

Opinion: the world is causally ordered in arbitrary ways as above. But the tools, frameworks, and infrastructure more readily available to us struggle at modeling arbitrary partially ordered causality graphs. So we shrug our shoulders, and we learn to live with the edge cases. But it doesn't have to be so.

link