|
|
|
|
|
by nyrikki
670 days ago
|
|
Only because you have chosen to introduce configuration and maintenance complexity by using airflow as enterprise wide middleware. In a modern even based SOA, products like airflow are a sometimes food while pub/sub is the default. Perhaps a search for images of the zachman framework would help conceptualize how you are tightly coupling to the implementation. But also research SOA 2.0, or event based SOA, the Enterprise Service Bus concept of the original SOA is as dead as COBRA. ETA: the minimal package load for airflow isn't bad, are you installing all of the plugins and their dependencies? |
|
There are a number of blog posts echoing a similar critique[1].
Using KubernetesOperators creates a lot of wrong abstractions, impedes testability, and makes Airflow as a whole a pretty overkill system just to monitor external tasks. At that point, you should have just had your orchestration in client code to begin with, and many other frameworks made this correct division between client and server. That would also make it easier to support multiple languages.
According to their README: https://github.com/apache/airflow#approach-to-dependencies-o...
> Airflow has a lot of dependencies - direct and transitive > The important dependencies are: SQLAlchemy, Alembic, Flask, werkzeug, celery, kubernetes
Why should biz logic that just needs to run Spark and interact with S3 now need to run a web server?
[1] Anecdotes from various posts - https://medium.com/bluecore-engineering/were-all-using-airfl... - https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-co... - https://dagster.io/blog/dagster-airflow
> Airflow, in its design, made the incorrect abstraction by having Operators actually implement functional work instead of spinning up developer work.
> By simply moving to using a Kubernetes Operator, Airflow developers can develop more quickly, debug more confidently, and not worry about conflicting package requirements.
> Airflow lacks proper library isolation. It becomes hard or impossible to do if any team requires a specific library version for a given workflow
> There is no way to separate DAGs to development, staging, and production using out-of-the-box Airflow features. That makes Airflow harder to use for mission-critical applications that require proper testing and the ability to roll back
> Data pipelines written for Airflow are typically bound to a particular environment. To avoid dependency hell, most guides recommend defining Airflow tasks with operators like the KubernetesPodOperator, which dictates that the task gets executed in Kubernetes. When a DAG is written in this way, it’s nigh-impossible to run it locally or as part of CI. And it requires opting out of all of the integrations that come out-of-the-box with Airflow.