Hacker News new | ask | show | jobs
A Scala API for Google Cloud Dataflow (github.com)
96 points by Mullefa 3739 days ago
5 comments

It's a bit interesting that Cloudera went the opposite way than Spotify and fitted Google's Java API on top of Spark instead (so, changed the backend instead of the "frontend") [1].

[1] https://github.com/cloudera/spark-dataflow

Scio author here.

A bit background: Spark and Flink are both frameworks with their own execution engine. Scalding is tightly coupled with Cascading + Hadoop as it's execution engine (also tez WIP). Dataflow Java SDK/Apache BEAM on the other hand is designed to be a simple abstraction with pluggable engines and Cloud Dataflow service is just one of the many runners possible.

Right now there are:

- local runner

- Dataflow runner, fully managed service in GCP

- Spark runner

- Flink runner

Scio wraps Dataflow Java SDK(Apache BEAM) and can potentially leverage any runner available.

Interesting project, glad to see more and more organisation are using Scala with Data projects.
scio is also the name for a portable molecular sensor: https://www.consumerphysics.com/myscio/scio
this is cool! thanks for mentioning it, i may have to grab myself a unit or two.
this is the best non-thread related thread-related post ever. very cool!!
Is this native? Or just a Scala wrapper?
I would expect it to be as native as Google's own Java API [1], though it is still just the API, not the actual backend.

[1] https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Correct it's a thin Scala wrapper with some additional features. Execution is delegated to Dataflow/BEAM.
Any plan to port it to Beam?
Scio author here. Yes as soon as BEAM finishes bootstrapping.