Hacker News new | ask | show | jobs
by lobster_johnson 3276 days ago
This is very useful.

I wish it had some information about supported languages. Most of the processing systems are JVM-based and require that you write your program in a JVM language. Some have Python support. But I have yet to encounter one that allows you write your pipelines in Go, Rust or JavaScript, for example. One notable exception is Storm, which supports pluggable runners, including one that talks to an external program over standard I/O. My impression that aside from Python, today's pipelines require a large amount of JVM buy-in, something I'm personally not interested in.

I'd also love some kind of metric for "aliveness". For example, my impression is that Storm was hot for about a week, and then Spark and Flink happened, and now nobody is talking about it, and Twitter itself has apparently replaced it with Heron.

4 comments

Storm is very much alive. Many of its users are simply running it reliably in production now. At my company, we are well past our trillionth production tuple running through Storm.

Also note that unlike Spark, Storm is a pure open source project that does not have a major commercial entity marketing its use cases. Hortonworks has put a little marketing effort behind it, but otherwise, it's just a mature & active Apache infrastructure project. Storm 2.0 is coming out soon and features a slew of performance- and reliability-improving enhancements.

But as for marketing buzz, Google has commercial reasons for you to use Beam and Dataflow, for example. And likewise Databricks for Spark.

It's probably a good idea to pick production large-scale data infrastructure on a metric other than recency of marketing buzz.

-$0.02 from one of the original authors of streamparse, the Python API for Storm

Thanks, that's helpful. Is building a pipeline with Java, consisting entirely of shell spouts, a viable option? Are there downsides to not using the Java API?
I agree, though the storm IRC channel is a bit of a ghost town the google group is fairly active.
If you're looking for something that doesn't constrain you to a particular language take a look at Pachyderm. It's built around containers so you can run any code you want. I designed it with JVM-phobes like you (and me) in mind.

https://github.com/pachyderm/pachyderm

Cool, thanks! I haven't looked closely at it, but the "version controlled" part is something I don't need/want. Does it get in the way at all? I'm mostly looking for incremental, semi-real-time streaming processing, not something where you shoot off a big job on a dataset and get back results.
The version control semantics of the system are pretty crucial for some of the features you describe wanting. Pachyderm supports incremental operations on stream-like datasets. But what's going on under the hood is that the dataset is being version controlled and thus the system can tell which data has changed and only process the new data. Hope that helps, I'd be happy to chat more about your specific use case. Shoot me an email at jdoliner@pachyderm.io
You can also use Spark's pipe to call external programs.
Ergonomically speaking, how practical is that? Do you get all the benefits and performance of an equivalent Java pipeline?
Do you see any reason Golang would be less suited for those tasks?