Hacker News new | ask | show | jobs
by dmitriid 1981 days ago
I can kinda agree on GCP with one exception: Dataflow. I have no idea what the future holds for it.

It is a managed Apache Beam service and is very useful for certain scenarios (like "hey, we have a million incoming PubSub messages that we need to transform into a dozen different branching streams of data"). It looks like even BigQuery actually transforms SQL statements into a bunch of Dataflow jobs.

But...

But...

- Minor version updates to Google Dataflow SDK once every couple of months while deprecating most other minor versions? Check.

- No visible contributions to Apache BEAM itself? Check. In 2021 I still don't know if I can use any Java versions beyond Java 8 to develop for and run in Dataflow. And Google is arguably one of the biggest users of Apache BEAm, and definitely a user with the largest pile of money to throw at the problem.

- They've recently sent out a questionnaire about Dataflow to some of their customers that feels like a "hey, we're definitely considering deprecating this, we're gauging the potential impact"

4 comments

Disclosure: I work on Google Cloud (and with the Dataflow folks on occasion).

Sorry, if you're getting mixed messages. Dataflow is here to stay. Google, Spotify, Twitter, and many other large customers heavily depend on it. Twitter moved their entire ad revenue pipeline to it [1] last year.

A quick perusal though of https://github.com/apache/beam/commits/master shows decent Googler activity. Can you highlight where you were looking for "no visible contributions"? (Maybe we do a bad job of being visible?).

[1] https://cloud.google.com/blog/products/data-analytics/modern...

Interesting comment, definitely want to hear more. I have concerns about Beam/Dataflow, but they seem different to yours.

The dataflow product seems to run older versions of Apache Beam just fine, so minor deprecations don’t seem like an issue in practice, but maybe I’m mistaken.

“No visible contributions to Apache BEAM itself”. I don’t think this is true, I’m a contributor and somewhat active on the developer mailing list, it seems the majority of the contributions these days come from google employees.

If the questionnaire you’re referring to was the paid Apache Beam survey, I participated and definitely didn’t get the impression that they were considering deprecating the service. It was much more focused on how they can improve docs, examples, and help developers use it.

Now, I think the project is too ambitious even for google. They don’t need to support Spark/Dataflow/Flink on three different languages (java/python/go) imo. I’m also frustrated with some of the bugs that slip through.

The fact that there is no back pressure support for a streaming framework is such a google thing to do: why worry about back pressure if you can just tell another team to increase their throughput for downstream sinks? /s

Dataflow does seem to be one of GCP’s most popular services (spotify and twitter are both users now) so I would guess it is here to stay in some form.

And GCP's Director of Outbound Product Management saying things like, "I’ve been thinking about the cool ways @GCPcloud reinvented public cloud... Sometimes you have to leave the past behind, and we haven’t hesitated to re:tire services and features. HIYOOOOO! We’re getting better though :)" doesn't really inspire any confidence either.

[0] https://archive.is/l6s5Q

I love Dataflow, but share the same concerns. It feels like something that's going to die of neglect even if it's not intentional.