Hacker News new | ask | show | jobs
by fdgsdfogijq 1588 days ago
This post is hard to follow. But I'll give my unsolicited opinion on airflow:

Its too complex to run as a single team and there are far better tools out there for scheduling. Airflow only makes sense when you need complex logic surrounding when to run jobs, how to backfill, when to backfill, and complex dependency trees. Otherwise, you are much better off with something like AWS step functions.

9 comments

Everyone's context is different, but I've found the exact opposite to be true. Airflow is simple and dumb enough that it can be easily understood and managed by a small team, but it's also flexible and powerful enough that we can't come up with a good enough reason to switch to anything else.*

*We are, however, becoming more and more reliant on dbt, and the article makes a good point about Airflow providing no visibility for what's going on in a dbt node. So we're ending up with an increasingly simpler Airflow dag, with most of the complexity hidden inside a single dbt node.

This reflects how I often deploy Airflow as well (usually on GCP as Composer)

We use DBT to manage the DAG for the BQ transformations, put this in a container and deploy it into the kubernetes cluster that airflow is running on as a single node.

Airflow can then handle the scheduling and DAG nodes for non DWH dependencies such as loading/checking for files, kicking off tasks that need to run after the DWH refresh and the like.

I find once it is set up it is extremely easy for small teams to follow the pattern, and the single view of all the pipelines running is a great benefit - as well as handling the logic around last successful runs etc., that would need to be implemented manually if using simple cron jobs.

I'm not too familiar with the use of dbt but what was the reason you chose to have a single dbt node rather than translating the dependencies into an airflow dag?
I understand it is subjective. But I use a forked version of https://github.com/puckel/docker-airflow on our managed K8s cluster and it points to a cloud managed Postgres. It has worked pretty well for over 3 years with no-one actually managing it from an infra POV. YMMV. This is driving a product whose ARR is well in the 100s of Millions.

If you have simple needs that are more or less set, I agree Airflow is overkill and a simple Jenkins instance is all you need.

I run Airflow even for my local trading setup. For large teams, I often go with managed solutions like Astronomer.
Hi, I am the author of the post which parts did you find hard to follow?
> there are far better tools out there for scheduling

Really? Which ones? The only thing vaguely fitting this case is Jenkins, but using Jenkins to run ETL/ELT is a serious impedance mismatch.

Dagster/Prefect are the alternatives.

But yes, I'm confused. Triggering a dag and having it exit based on complex logic is a perfectly normal pattern.

Interesting, I wouldn’t say that I’ve found it difficult to run in even a small team.

The problem I’ve always had with Airflow has been with non-cron-like use cases, for example data pipelines kicked off when some event occurs. Sensors were often an awkward fit and the HTTP API was quite immature back when I was using it

Agreed about sensors. We still have some trouble figuring them out and understanding why they sometimes don't trigger when they should.
I manage and run our airflow instance - outside of migrating from 1.X to 2.x I haven't really had any problems. Learning curve was a bit higher than I hoped, but being able to set tasks downstream and backfill is so much nicer than regular cron / windows task manager script running.
There really aren't many alternatives out there after cron. Maybe lambda jobs count? What are you thinking of as alternatives?
do you have recommendations on alternatives that are not tied to a cloud provider?
We are trying to build something like this at https://www.magniv.app/.

Would love to have you join our beta if you are interested!

Shipyard, Prefect, Dagster are all good options. Lots of newcomers in the orchestration space.
jenkins