Hacker News new | ask | show | jobs
by the_af 1588 days ago
As a newcomer to the world of data, I have no strong opinions about Airflow. It replaced a bunch of disparate cron jobs, so it's definitely better than what was there before.

There are things I like and things I don't about it. The UI is awful -- I don't know anyone that likes it, unlike what the article states. I like that it's centralized and that it's all Python code.

Deploying it and fine-tuning the config for a variety of workloads can be a pain. Sometimes sensors don't work right. Tasks sometimes get evicted and killed for obscure reasons. Zombie tasks are a pain big enough you'll see plenty of requests for help online.

That said, replacing it with a bunch of disparate tools again? Seems like a step backwards. And now instead of a single tool, your org has to vet, secure, understand and monitor a bunch of different tools? It's bad enough with only one...

What am I missing?

PS: data analysis/engineering as a field seems new and immature enough that, in my humble opinion, we should be focusing on developing good practices and theory, instead of deprecating existing (and pretty recent) tech at an ever increasing pace.

3 comments

What you're missing is that for much of enterprise software before Airflow, everything was steaming rubbish.

Airflow is... not amazing. But by the standards of horrible enterprise software we've all been subjected to, it's not that bad.

If you're complaining about Airflow, wait for the day you're forced to use an internally built database client.

That's Afghanistan.

Our proprietary AWS wrapper takes 45 damn minutes on a good day to allocate a VM. The AMI is built in two minutes. TWO.

I'm sure in 5 years Dagster and Prefect will have improved gradually in lots of incremental ways. For now Airflow is pretty solid.

> If you're complaining about Airflow

Wait, maybe I explained myself badly: while I am complaining about some things I dislike about Airflow, at the same time I'm saying it's better than the random assortment of cron jobs we had before, and pushing back against the idea of "unbundling" it and going back to disparate tools by separate vendors.

I like writing Python code, I feel in control.

I have memories of pasting 10 line powershell scripts into one of those tiny windows XP text entry boxes, and being happy I could do so!
Thanks for saying this. I also have been tasked to introduce airflow at my company. I decided to use 2.0 so it's more python dags. But for the most part the dags are JUST triggered via web service by other processes.

so... it's nothing more than processing plus a queue. I mean we already have rabbit and typescript. We also already have Typescript + Agenda (over mongo).

We have gotten to the point where a single company is implementing queuing at least 4 different ways because "microservices".

> data analysis/engineering as a field seems new and immature enough that, in my humble opinion, we should be focusing on developing good practices and theory, instead of deprecating existing (and pretty recent) tech at an ever increasing pace.

I disagree with you, data engineering as a field has been there for a very long time. Good practices exists and are good enough to accommodate for new ones, like MLOps and data versioning.

However for every great DE setup, you can find at least ten other that are complete pile of shit, featuring mission-critical scripted SQL reports that no one understand anymore and closed source orchestration products with millions-dollars support contrats that only one person has access to.

As always, tooling is rarely an issue. Data Engineers are rarely working on the overall "big picture" and are often given tasks without context. Embedding data engineering with product and infrastructure teams are the solution to that issue.