| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 8589934591 1485 days ago

I echo other comments. Running and managing Airflow beyond simple jobs is complicated. But then if you are running and managing Airflow for simpler jobs, then you might not need Airflow.

One data center company that I know of uses airflow at scale with docker and k8s. They have a huge team of devops just to manage the orchestrator. They in turn have to fine tune the orchestrator to run smoothly and efficiently. Similar to what shopify has noted here, they have built on top of and extended airflow to take care of pain points like point 4. For companies like this it makes sense to run airflow.

Another issue I see companies/engineers who adopt airflow is that they use it as a substitute for a script than as an orchestrator. For example, say you want to download files from an API, upload to s3, load it to your warehouse (say snowflake) and do some transformations to get your final table - instead of writing separate scripts for each step of fetch/upload/ingest/transform and call each step from the dag, they end up writing everything as a task in a dag. A huge disadvantage is there is a lot of code duplication. If you had a script as a CLI, all your dag/task has to do is call the script with the respective args. I agree that airflow comes with a lot of convenience wrappers to create tasks for many things but I feel this results in losing flexibility.

This also results in them tying their workflow with airflow and any change they might need they have to modify their airflow code directly. If you want to modify how/what you upload to s3, you end up writing/modifying python functions in the respective dags' code. This removes the flexibility to modify/substitute any component of the workflow with something else or even change the orchestrator from airflow to something else. Additionally, different teams might write workflows in different ways - standardization of practice is really hard. This in turn results in pouring more investments to maintaining and hiring "airflow data engineers". Companies fall into steep tech debts.

Prefect/dagster are new orchestrators in town. I'm yet to try them out but I've heard mixed reviews about them.

EDIT: Forgot about upgrades. Lot of upgrades are breaking changes esp the recent change from 1->2. You end up spending a lot of time just trying to debug what went wrong. Just installing and running it is a pain.

2 comments

rockostrich 1485 days ago

We've established a rule that all "custom" code (anything that isn't a preexisting operator in airflow) needs to be contained in a docker image and run through the k8s pod operator. What's resulted is most folks do exactly what you said. They create a repo with a simple CLI that runs a script and the only thing that gets put in our airflow repo is the dependency graph/configuration for the k8s jobs.

claytonjy 1485 days ago

AFAICT this is the now-recommended way to use Airflow: as a k8s task orchestrator. Even the Astronomer team (original Airflow authors) will tell you to do it this way.

blakeburch 1485 days ago

Love your observation about tying the workflow to Airflow.

One of my biggest annoyances in the orchestration space is that teams are mixing business logic with platform logic, while still touting "lack of vendor lock-in" because it's open source. At the point that you're importing Airflow specific operators into your script and changing the underlying code to make sure it works for the platform (XCom, task decorators, etc.), you are directly locking yourself in and making edits down the road even more difficult.

While some of the other players do a better job, their method of "code as workflow" still results in the same problems, where workflows get built as a "mega-script" instead of as modular components.

I'm a co-founder at Shipyard, a light-weight hosted orchestrator for data teams. One of our core principles is "Your code should run the same locally as it does on our platform". That means 0 changes to your code.

You can define the workflow in a drag and drop editor or with YAML. Each task is it's own independent script. At runtime, we automatically containerize each task and spin up ephemeral file storage for the workflow, letting you can run scripts one after the other, each in their own virtual environment, while still sharing generated files as if you were running them on your local machine. In practice, that means that individual tasks can be updated (in app or through GitHub sync) without having to touch the entire workflow.

I'm biased, but it seems crazy to me that so many engineers are willing to spend hours fighting the configuration of their orchestration platform rather than focusing on the solving the problems at hand with code.