| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by elric 1807 days ago
	I work for a company that could greatly benefit from an out-of-the-box distributed stream processing engine (we've been rolling our own for over a decade). At this point, it's pretty much impossible to pick one. All similar Apache tools have similar looking web pages, promising similar benefits, similar use cases, etc. What are the differentiating factors? At which point does it make sense to pick Heron over one of the others? Or vice versa?

4 comments

pyrophane 1807 days ago

I last looked into this a couple of years ago, so this might be slightly out of date.

I think most popular options for high-volume self-hosted distributed stream processing solutions are still Spark, Flink, and Kafka Streams.

Kafka streams is simpler, as it is basically just a framework on top of Kafka itself, so if you already use Kafka for streaming data and don't have complex needs, it might be a good option.

Spark and Flink are similar. Both support both batch processing (on top of Hadoop, for example) and stream processing. Spark has better tooling, but Flink has more sophisticated support for streaming window functions. Spark also uses "micro-batches" instead of being truly real-time, so there will be a bit more latency when doing streaming with Spark, if that matters.

Another interesting project is Beam, which provides a unified way of writing jobs that can then be run on different engines that support it (both Flink and Spark do, as well as Dataflow on Google).

Apache hosts a lot of projects in this category. Most (like Storm) I would probably not pick up for a greenfield project today. Also, these things come with some significant operational overhead, so make sure you really need them. Stream processing at scale is hard. The compelling use case for these things is when you need to do window aggregations on a lot of streaming data and get results in real-time.

link

emmelaich 1806 days ago

Databricks if you don't want to do it yourself. (which is Apache Spark)

I have no relationship with Databricks.

link

random314 1807 days ago

Look at Flink or Heron. Other choices like storm aren't any good.

link

elric 1807 days ago

While I appreciate your taking the time to make suggestions, these suggestions aren't very useful without context. I'm not saying that it's up to you to provide the context. But I don't know much about Flink or Heron, and looking at their respective websites doesn't tell me whether they'd be a good fit for a specific use case.

At this point, all of these frameworks would probably benefit from a flowchart (or questionnaire tool) that can guide someone towards an informed decision. "Do you need redundancy?" - "Can you afford to lose some messages in situation XYZ?" - "How many events/sec do you want to process?" - "How much hardware can you throw at the problem?" etc.

link

random314 1805 days ago

The situation is actually more complex than you are suggesting. A check box comparison is not useful. I have worked in considerable depth in the streaming space and my comments are based off the design docs of both systems. You should read Twitter's heron paper and Apache Flink design docs.

For eg, storm or samza might check all the boxes, but the design of the system is poor enough that the performance will suck. For older versions of Storm, you should be able to write a multithreaded app on a single machine that outperforms a storm cluster.

link

himoacs 1807 days ago

I recommend checking out Solace. They have been in business for 20 years. It's not open source though but packs all the enterprise features you would need.

link

haik90 1807 days ago

Had a lot trouble running solace on vm and docker, spend a lot time try to find root cause memory leak (happen every few months)

I still don't get why they use VPN terms for event broker

link

himoacs 1807 days ago

That's a shame. I work at Solace so shoot me a message and I will show you how to set it up if you are interested.

Solace's first product was hardware appliances which are still used for high throughput and low latency usecases. Concept of VPN was used to set up isolated virtual brokers so different teams can have their own environments on a shared hardware appliance.

The concept was ported over to software as well and is extremely useful in an enterprise environment. It allows different teams to have their own virtual brokers but not have to pay for or manage multiple brokers.

link

Nullabillity 1807 days ago

> I work at Solace

That recontextualizes your previous post quite a bit...

link

_ea1k 1807 days ago

https://xkcd.com/927/

link