Hacker News new | ask | show | jobs
by abadid 1915 days ago
I'm the author of this piece. I'm happy to respond to comments in this thread.
1 comments

At what point does it not make sense anymore to perform all steps of an operation on a single partition of data together? Is there a point of diminishing returns?

I've seen some stream processing systems follow the partition data and apply all transformations at once (e.g. Kafka Streams) while others parallelize the transformations (e.g. Apache Storm IIRC).

Also isn't there a tradeoff that in a depth-first (for lack of better term) processing paradigm error-recovery becomes more costly?

In general, whenever you need to perform a join (of multiple datasets), that ends the pipeline of local operations on a partition. Other operators as well that necessarily require data from other partitions end local pipelines. This is why linear scalability is not completely achieved in practice. Most interactions with data cannot be performed in a completely partitionable way.

Usually those other operations which force the local pipeline to end occur in a query plan prior to hitting any kind of tradeoff of doing too much in a local pipeline, since local pipelines are SO much faster than what happens when communication is required.