|
|
|
|
|
by jpollock
2369 days ago
|
|
Why are all these tools DAGs? I'm guessing I misunderstand what is meant by a "Workflow"? My assumption is that these are managing a state machine, where workflow is stand-in for "Business Process"? If it's doing that sort of job, I'd expect timers and loops? However, it seems these are aimed at data conversion pipelines? |
|
Data transformations are one thing. For us, it’s the most important thing. Our data warehouse runs as a massive DAG of nightly batched transformations over app-generated data.
We also use DAG-managing tools to call external APIs and get new data (eg for weather and geocoding) and batched ML training/inference pipelines too.
Why something like Airflow? Dependencies are easier to manage reliably. If you have hundreds or thousands of nodes in your DAG, then it is a lifesaver to be able to easily 1) run many threads of independent nodes; 2) re-run on failures; and 3) find nodes impacted by failure.