Hacker News new | ask | show | jobs
by doteka 1807 days ago
How many distributed stream processing engines is the Apache foundation planning to collect? At this point it seems like there’s more projects that do this (if you squint a bit), than companies with a serious usecase for that type of architecture.
6 comments

I work for a company that could greatly benefit from an out-of-the-box distributed stream processing engine (we've been rolling our own for over a decade). At this point, it's pretty much impossible to pick one. All similar Apache tools have similar looking web pages, promising similar benefits, similar use cases, etc. What are the differentiating factors? At which point does it make sense to pick Heron over one of the others? Or vice versa?
I last looked into this a couple of years ago, so this might be slightly out of date.

I think most popular options for high-volume self-hosted distributed stream processing solutions are still Spark, Flink, and Kafka Streams.

Kafka streams is simpler, as it is basically just a framework on top of Kafka itself, so if you already use Kafka for streaming data and don't have complex needs, it might be a good option.

Spark and Flink are similar. Both support both batch processing (on top of Hadoop, for example) and stream processing. Spark has better tooling, but Flink has more sophisticated support for streaming window functions. Spark also uses "micro-batches" instead of being truly real-time, so there will be a bit more latency when doing streaming with Spark, if that matters.

--

Another interesting project is Beam, which provides a unified way of writing jobs that can then be run on different engines that support it (both Flink and Spark do, as well as Dataflow on Google).

Apache hosts a lot of projects in this category. Most (like Storm) I would probably not pick up for a greenfield project today. Also, these things come with some significant operational overhead, so make sure you really need them. Stream processing at scale is hard. The compelling use case for these things is when you need to do window aggregations on a lot of streaming data and get results in real-time.

Databricks if you don't want to do it yourself. (which is Apache Spark)

I have no relationship with Databricks.

Look at Flink or Heron. Other choices like storm aren't any good.
While I appreciate your taking the time to make suggestions, these suggestions aren't very useful without context. I'm not saying that it's up to you to provide the context. But I don't know much about Flink or Heron, and looking at their respective websites doesn't tell me whether they'd be a good fit for a specific use case.

At this point, all of these frameworks would probably benefit from a flowchart (or questionnaire tool) that can guide someone towards an informed decision. "Do you need redundancy?" - "Can you afford to lose some messages in situation XYZ?" - "How many events/sec do you want to process?" - "How much hardware can you throw at the problem?" etc.

The situation is actually more complex than you are suggesting. A check box comparison is not useful. I have worked in considerable depth in the streaming space and my comments are based off the design docs of both systems. You should read Twitter's heron paper and Apache Flink design docs.

For eg, storm or samza might check all the boxes, but the design of the system is poor enough that the performance will suck. For older versions of Storm, you should be able to write a multithreaded app on a single machine that outperforms a storm cluster.

I recommend checking out Solace. They have been in business for 20 years. It's not open source though but packs all the enterprise features you would need.
Had a lot trouble running solace on vm and docker, spend a lot time try to find root cause memory leak (happen every few months)

I still don't get why they use VPN terms for event broker

That's a shame. I work at Solace so shoot me a message and I will show you how to set it up if you are interested.

Solace's first product was hardware appliances which are still used for high throughput and low latency usecases. Concept of VPN was used to set up isolated virtual brokers so different teams can have their own environments on a shared hardware appliance.

The concept was ported over to software as well and is extremely useful in an enterprise environment. It allows different teams to have their own virtual brokers but not have to pay for or manage multiple brokers.

> I work at Solace

That recontextualizes your previous post quite a bit...

The long term trend for data access is having a reactive component; stream processing engines allow you do have that.

On the other hand databases have a huge lock-in power (been trying to strangle an Oracle myself for over 5 years now). It is lucrative to be in the database business.

I'd say that every project with a claim of some improvement can, and should try to establish itself in the market; and that having it join the Apache foundation is a great way to get some brand recognition on the cheap.

----

Also, Heron is not that new. It has been developed at Twitter, for replacing Storm IIRC.

But none of them spell out for what use cases they excel over the other options. Since they are all under the ASF banner, it’s impossible to know which is “better” for me. They are all just “great stream processing engines”. But surely they must have diverging properties for given use cases. But none of the pages even attempt to say how they differ. Just “try it out!!”
Years ago, I found a scientific paper that evaluated them all on a fairly detailed rubric. As I recall, it turned out that they're all, in actual fact, crap stream processing engines. You just need to pick the one that's crap at things you don't need to do.

(I could frame it in a more glass half full way, but I find that the pessimistic way of looking at it helps a lot with trimming down options when you have far, far, far too many options.)

Citation needed. And not in the snarky way - I'd really like to read that paper :)
I'm guessing they were referring to "Scalability! But at what COST?"

https://www.usenix.org/system/files/conference/hotos15/hotos...

I've long since lost track of it. It would be 5 or 6 years out of date at this point, anyway, so probably not very useful for informing decisions anymore.
I've just accepted that mindset as a part of the art of engineering. Shifting the inherent crappiness of a domain around to where it doesn't matter as much for your use case.
I started using Apache Storm in 2012 and was really surprised at the number of Apache projects for stream processing back in 2015 or so when I did some research for moving off of storm. I remember Flink, Samza, Spark, Flume and NiFi(?) all being probably suitable options. I remember thinking exactly what you said!

I guess there are enough users to keep the communities alive?

*Enough companies reinventing the wheel to spinoff new Apache distributed stream processing libraries.
It really comes down to the latency and throughput requirements. Projects are architected differently for different latency expectation and different throughput expectation.

In event processing there's a continuum of expected latencies from batch processing to realtime. Batch processing is typically running reports over large volume of events (good for throughput). Hadoop is a good example. On the other end, sub-second realtime report is possible with Heron/Storm. Spark is kind of in the middle with hybrid mini-batching. Reportedly Twitter has used Heron/Storm to track word counts in all the tweets to find trending topics, where the latency between a new tweet coming in to the word counts updated over the whole network is in 100s milliseconds.

Apparently it's another one from Twitter.
At least they had the decency to not open source their Bookkeeper backed pub/sub system when they gave up on it and switched to Kafka...

...then some of the employees involved moved to Yahoo, built it again, and then open sourced it as Pulsar. Then moved onto form a Confluent v2 to sell their Kafka v2 (now with even more Zookeeper!)

I had a problem, so I invented Summingbird, and now I can relatively inexpensively invent new problems on an annual basis.
I wonder this too - I guess enough software engineering teams had similar problems with no solutions and each set about building their own? Not sure how else this proliferation happened.