Hacker News new | ask | show | jobs
by tpaschalis 2852 days ago
There's a bunch of different tools to do the same job, from manual cron jobs, to Luigi, Pinball, Azkaban, Oozie, Taverna, Mistral.

I've started to use it for personal projects, and slowly probing for adoption in our shop, where applicable.

The good points I have seen

- It's simple Python, and not XML like Azkaban. I've seen people with less technical expertise build useful stuff quickly, and automate their workflows.

- Very good UI, which just lets you do what you need without fuss.

- Easy to build modular and interactive flows, with interesting stuff as sensors, communications between operators, triggers etc.

- Everything is stored into a database, which I can query about anything related to the processes run and Airflow itself

- Its source is grok-able and documented, it allows you to easily add your own modules (or "operators" as they're called)

- Many add-on modules for operators already exist from the community

- Easier to force the team to version control your process flows

Some cons, from the light use I've seen

- If you scale beyond a point, you have to take care of scaling the database as well, adding DBA work

- I've encountered some issues with scheduler and backfilled jobs, and `depends_on_past`, but it might be my limited experience

- People may start to use specific external dependencies/modules, which you will then need to keep track of

- Uses its own lingo/terminology, which you'll have to learn and use

- Uses system time, so no running in different timezones

I have high hopes for the project, as it's currently incubating for the Apache Foundation, and I hope it remains minimal and keeps the present scope.

If it seems interesting to you, my suggestion is to start small, keep in mind that it handles relations between tasks and not data, and try to automate some easy bash script that you currently handle with cron.

1 comments

> There's a bunch of different tools to do the same job

Yup. Almost, too many tools, in fact.

https://s.apache.org/existing-workflow-systems

As a somewhat related addendum, some of the worst pipelines I've encountered have been in scientific computing. Conceptually, DAGs are quite simple but for some reason things always end up gummed up in implementation, which is partially why scientific results are much harder to reproduce than they should be. The disconnect between how much harder pipeline creation is compared to how easy it should be in the sciences has always confused me a bit.