Hacker News new | ask | show | jobs
by scaleout1 3681 days ago
As someone who has used Heron (along with MillWheel, Spark Streaming and Storm) I feel like this announcement is too late. The biggest thing Heron offer is raw scale but since they decided to use existing Storm API, it has the same shitty spout /bolt API that Storms offer. In contrast, Spark streaming/Flink/ Kafka Streaming are all offering map/flatmap/filter/sink based functional API. At twitter most teams used SummingBird on top of Heron to get the same functional API but summingbird didnt get a lot of traction outside twitter and I am not sure how actively maintained OSS version of summingbird is. Even if you bite the bullet and decide to use SB with Heron, you will still miss out on a lot of usecases as SB was mostly focused on doing read/transform/aggregate/write whereas most streaming problem that i have noticed outside of twitter involve doing read/transform/aggregate/decision/write. I suppose you can implement decisioning in SB but i havent seen it done.

Comparing Heron to google millwheel is interesting because of the design choices they made. Heron support at least one and at most once message guarantees but at Twitter most job ran with acked turned off so it was at most once with acknowledged data loss ( they had a batchjob doing mop up work to pick up missing data). Google on the other hand implemented exactly once semantic by doing idempotent sinks/ watermarking and managing out of order messages plus deduping support. Since both Flink and Spark will be implementing Apache Beam (millwheel's predecessors) model, only reason I see someone picking heron instead of Flink/Spark is that they are operating at massive scale that flink/spark dont support yet

4 comments

Storm is a low level system for managing (optionally) transactional multi-machine tasks. It makes no assumptions about what is being processed (ie. analytics, data transforms). The primitives you are talking about exist in the child project Trident which runs on top of storm. Storm itself is no more for analytics than a web-server. It is a lower level tool.
The parent also ignored the time-to-process difference which is drastically lower in storm. It has its flaws but scale is not the only metric to use as a decider
> they are operating at massive scale that flink/spark dont support yet

Flink certainly scales just fine, for what it's worth. Flink 1.0 is quite good, and I'd consider what I'm doing "massive scale"; the ease of 1MM+ QPS with decent p95 latency via Flink surprised me compared to other systems that I investigated in this space. Most hip-fired benchmarks, including that awful Yahoo! one that everybody cites, use Flink poorly.

Rest of your comment is great and I couldn't agree more. Spot-on analysis. Twitter made a misfire here buying out Nathan Marz, neglecting Storm in favor of Heron while the rest of the field advanced (notably Google's open source work and Flink), announcing Heron which is so much better but keeping it to the chest for a while, then losing out on both of their streaming engines in time. Storm and Heron both feel too little too late, particularly Storm's recent (vast) performance improvements which a lot of folks I know kinda shrugged at and which is kinda too bad.

The Dataflow/Beam/Flink stuff is the compelling horse right now, to me. Just my personal opinion.

> Most hip-fired benchmarks, including that awful Yahoo! one that everybody cites, use Flink poorly

Why are Yahoo! benchmarks awful? How did they manage to use Flink poorly?

For people who don't know what he is referring to, check this: https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...

I had a look at the open source SummingBird as a possible way to implement a (soft) real time project I have, because I'm not specially Java-ish and Storm does not seem to play that nice with Scala (I've been told it works decently with Clojure, though, that might have been a solution to my non-Javaness) and it looked somewhat stale.

Ditched it and decided to do it in Spark with Scala (making it a good excuse to learn Scala). With so many real time options popping up and around, deciding which to pick is getting harder and harder.

Thank you for this comment. It contained orders of magnitude more useful information about the API choices and data models that define this system than the linked article itself.