Hacker News new | ask | show | jobs
by SpaceManNabs 2877 days ago
Hey, I am a huge user of Apache Flink in personal and work projects. Skimmed through your readme, but still have the following question.

What are the advantages of using Faust feature-wise or otherwise? Are you guys planning on having feature parity with Flink (triggers, evictors, process function equivalent, etc.)? I can definitely see its use in having better compatibility with ML/AI env and the other tools in python toolkit. But specifically regarding ML/AI, I can export those things into the JVM usually. And with regards with Flask/Django, I can use scala http4s.

Sorry if that seemed like a lot! Trust me very excited about this project!

1 comments

You are the first Flink user I have come across! Could you briefly explain why it is so compelling over its competitors?
When it comes to batch, I still think Apache Spark and the standard scientific computing stack in Python are king.

But when it comes to streaming, I don't think Spark Structured Streaming was very close, although Spark 2.3.0 might be sufficient for your use cases, and I use spark structured streaming only when I have to deploy spark pipeline/ml.

But, besides the features I listed, I think Flink is just better written for streaming.

For example, defining checkpoints and state management in Flink is so much more expressive and easier to write (and perhaps more performant from what I saw at Flink Forward).

Flink Async is amazing!!!!

The Flink table/sql API is not as nice as Spark SQL for batch right now, but it is getting there.

Flink also seems to perform better than Spark Structured Streaming when you turn on object reusability, at least according to Alibaba and Netflix.

And then, the features I listed are amazing. Evictors and triggers are super helpful. Process Functions let me write custom aggregations without a lot of mental overhead.

Time Characteristics are really nice in Flink.

Flink has a lot of nice connectors and stuff written already (although Parquet files are troublesome though if they come from Spark because of the way that Spark handles it read and write support classes). For example, you have to write your own kinesis connector in spark structured streaming (or use the one put out by databricks on their blogs).

Sorry if this seems all over the place. I just posted what came to mind. And I should add, Databricks is doing a lot of work to make Spark Structured Streaming really comparable, so who knows what stuff looks like in 2-3 years. Like I said above, I use Spark Structured Streaming (2.3.0+) for some very specific ML stuff that my Flink environment cannot handle nicely yet.

My only complaint is that there isn't a lot of community written code out there, but the Data Artisans team is super active, helpful, and nice.

Thanks, this is super helpful and exactly what I am looking for. My main prospective use of Flink is as a target for Apache Beam, GCP Dataflow is great, but I want to have portable jobs, and Flink looks like the best target (over Spark).
Never used Beam before, and I am unsure if it uses all the features of Flink. And the documentation seems a bit sparse (I couldn't find how to tune checkpoints for example on a quick glance). But if you know how to tune it to use more of Flink or it works for your use case (it supports Python and Go), then go for it. It looks pretty neat.