Hacker News new | ask | show | jobs
by caravel 3394 days ago
In a modern data team, Spark is just one of the type of job you may want to orchestrate. Typically as your company gets more tangled in data processing, you'll have many storage and compute engines that you'll have to orchestrate. Hive, MySQL, Presto, HBASE, map/reduce, Cascading/Scalding, scripts, external integrations, R, Druid, Redshift, miroservices, ...

Airflow allows you to orchestrate all of this and keep most of code and high level operation in one place.

Of course Spark has its own internal DAG and can somewhat act as Airflow and trigger some of these other things, but typically that breaks down as you have a growing array of Spark jobs and want to keep a holistic view.

1 comments

that is an incredibly lucid answer. That should be the first line on the Airflow project.