| HN Mirror

We only use KubernetesOperators, but this has many downsides, and it's very clearly a 2nd thought of the Airflow project. It creates confusion because users of Airflow expect features A, B, and C, and when using KubernetesOperators they aren't functional because your biz logic is separated. Eg., if your biz logic knows what S3 it talks to in an external task, how can Airflow? So now its Dataset feature is useless.

There are a number of blog posts echoing a similar critique[1].

Using KubernetesOperators creates a lot of wrong abstractions, impedes testability, and makes Airflow as a whole a pretty overkill system just to monitor external tasks. At that point, you should have just had your orchestration in client code to begin with, and many other frameworks made this correct division between client and server. That would also make it easier to support multiple languages.

According to their README: https://github.com/apache/airflow#approach-to-dependencies-o...

> Airflow has a lot of dependencies - direct and transitive > The important dependencies are: SQLAlchemy, Alembic, Flask, werkzeug, celery, kubernetes

Why should biz logic that just needs to run Spark and interact with S3 now need to run a web server?

[1] Anecdotes from various posts - https://medium.com/bluecore-engineering/were-all-using-airfl... - https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-co... - https://dagster.io/blog/dagster-airflow

> Airflow, in its design, made the incorrect abstraction by having Operators actually implement functional work instead of spinning up developer work.

> By simply moving to using a Kubernetes Operator, Airflow developers can develop more quickly, debug more confidently, and not worry about conflicting package requirements.

> Airflow lacks proper library isolation. It becomes hard or impossible to do if any team requires a specific library version for a given workflow

> There is no way to separate DAGs to development, staging, and production using out-of-the-box Airflow features. That makes Airflow harder to use for mission-critical applications that require proper testing and the ability to roll back

> Data pipelines written for Airflow are typically bound to a particular environment. To avoid dependency hell, most guides recommend defining Airflow tasks with operators like the KubernetesPodOperator, which dictates that the task gets executed in Kubernetes. When a DAG is written in this way, it’s nigh-impossible to run it locally or as part of CI. And it requires opting out of all of the integrations that come out-of-the-box with Airflow.