Hacker News new | ask | show | jobs
by jpau 2366 days ago
Heyo! Data guy here. Airflow and its DAG-managing peers are important for us.

Data transformations are one thing. For us, it’s the most important thing. Our data warehouse runs as a massive DAG of nightly batched transformations over app-generated data.

We also use DAG-managing tools to call external APIs and get new data (eg for weather and geocoding) and batched ML training/inference pipelines too.

Why something like Airflow? Dependencies are easier to manage reliably. If you have hundreds or thousands of nodes in your DAG, then it is a lifesaver to be able to easily 1) run many threads of independent nodes; 2) re-run on failures; and 3) find nodes impacted by failure.

1 comments

Sorry, I most definitely didn't want to make light of the problem!

Pulling data from all the various teams' locally created data stores and external systems to push to analytics is definitely a large problem.

I was trying to figure out if these are aimed at data transformation pipelines, or state management systems - I've got state management problems, not data transformation problems.

Slightly different problems, but both fit with "Workflow".