|
|
|
|
|
by samuell
3739 days ago
|
|
It's a bit interesting that Cloudera went the opposite way than Spotify and fitted Google's Java API on top of Spark instead (so, changed the backend instead of the "frontend") [1]. [1] https://github.com/cloudera/spark-dataflow |
|
A bit background: Spark and Flink are both frameworks with their own execution engine. Scalding is tightly coupled with Cascading + Hadoop as it's execution engine (also tez WIP). Dataflow Java SDK/Apache BEAM on the other hand is designed to be a simple abstraction with pluggable engines and Cloud Dataflow service is just one of the many runners possible.
Right now there are:
- local runner
- Dataflow runner, fully managed service in GCP
- Spark runner
- Flink runner
Scio wraps Dataflow Java SDK(Apache BEAM) and can potentially leverage any runner available.