|
|
|
|
|
by elric
1807 days ago
|
|
I work for a company that could greatly benefit from an out-of-the-box distributed stream processing engine (we've been rolling our own for over a decade). At this point, it's pretty much impossible to pick one. All similar Apache tools have similar looking web pages, promising similar benefits, similar use cases, etc. What are the differentiating factors? At which point does it make sense to pick Heron over one of the others? Or vice versa? |
|
I think most popular options for high-volume self-hosted distributed stream processing solutions are still Spark, Flink, and Kafka Streams.
Kafka streams is simpler, as it is basically just a framework on top of Kafka itself, so if you already use Kafka for streaming data and don't have complex needs, it might be a good option.
Spark and Flink are similar. Both support both batch processing (on top of Hadoop, for example) and stream processing. Spark has better tooling, but Flink has more sophisticated support for streaming window functions. Spark also uses "micro-batches" instead of being truly real-time, so there will be a bit more latency when doing streaming with Spark, if that matters.
--
Another interesting project is Beam, which provides a unified way of writing jobs that can then be run on different engines that support it (both Flink and Spark do, as well as Dataflow on Google).
Apache hosts a lot of projects in this category. Most (like Storm) I would probably not pick up for a greenfield project today. Also, these things come with some significant operational overhead, so make sure you really need them. Stream processing at scale is hard. The compelling use case for these things is when you need to do window aggregations on a lot of streaming data and get results in real-time.