Hacker News new | ask | show | jobs
by atomicity 2132 days ago
Nothing is more popular yet, but there are better architected options out there. It's hit 17k GitHub stars and was used by the team I was previously at. I don't think anything will beat it unless something from the CI/CD or "cloud native" world moves in unexpectedly.

The operators and scalability are somewhat useful. I was happy with the UI compared to cron. Testing is a mess. Also, Airflow isn't CI/CD-friendly (but it's possible to get it to work).

I'd recommend a managed option unless you have a skilled ops team. It reminds me of Hadoop in terms of how exciting it is to get set up, which isn't a good thing.

2 comments

I can confirm all of this. I was involved with setting up airflow recently and we had a rather rough time because it is kind of a half assed solution. It's basically a framework that allows you to do stuff with a lot of plugins/connectors that may or may not be useful for you with a rather large variation in completeness, bugginess, documentation, and utility. A lot of it is kind of sketchy or even actively harmful but there are definitely some useful things as well.

It does not help that the entirety of the documentation is written from the point of view of people who are definitely not of the devops variety doing things manually on their laptop. I.e. all the wrong things you should never do in a production setup. Configuring this thing for production usage is largely undocumented, non trivial, and you'll be piecing things together from stackoverflow and various third party github repositories for e.g. using docker, terraform, etc. rather than the official documentation which merely hints at these things being possibilities.

It also does not help that the internals are kind of buggy and wonky. We had a really hard time getting the basic plumbing for running workers, queues, etc. working properly. It would constantly grind to a halt and stop processing stuff. Also there's this minutes long uncertainty principle "is it actually running or still figuring out that it needs to catch up?!".

Also, the UI/UX is terrible IMHO. Think hitting cmd+r a lot because page refreshes are not a thing in Airflow and absolutely everything requires dealing with multiple clicks to navigate complex dialogs (modal, naturally). So, unless you just manually reloaded the page: you are looking at stale information. Jobs that have long finished. Green statuses that have turned red, etc. Even Jenkins/Hudson had auto reload 15 years ago. And given the significant overlap in functionality, you might actually be better off using that if all you need is the ability to run some simple job at specific intervals.

The only valid reason for using Airflow is the ecosystem of plugins. It's valid and it's basically the same reason that people tolerated the craptastic experience that was managing Nagios back in the day. Horribly complicated to setup, terrible/primitive UI, loads of performance issues, non trivial failure modes, etc. but world + dog used it and there were nagios plugins for just about everything. I've been that rabbit hole as well and I'd say the experience is similar enough.

So, definitely use it in hosted form if you can or avoid altogether unless you really need it.

Can you expand on "testing is a mess"? Do you mean testing your own DAGs and operators?
Yeah, like the other reply, I'd mostly say testing DAGs was an issue. Airflow-related configuration is easy to get wrong and it silently fails a lot.

Now that I think about it though, most of the time I spent on testing wasn't caused by Airflow. Testing data pipelines just isn't easy with the current well-known tooling.

Probably DAGs - Operators can be tested from their hooks, but in my experience testing a DAG is annoying - I usually just make a copy that does a dry run/runs with test data, or just test in a local airflow container as it's much faster.