|
|
|
|
|
by jpau
2366 days ago
|
|
Heyo! Data guy here. Airflow and its DAG-managing peers are important for us. Data transformations are one thing. For us, it’s the most important thing. Our data warehouse runs as a massive DAG of nightly batched transformations over app-generated data. We also use DAG-managing tools to call external APIs and get new data (eg for weather and geocoding) and batched ML training/inference pipelines too. Why something like Airflow? Dependencies are easier to manage reliably. If you have hundreds or thousands of nodes in your DAG, then it is a lifesaver to be able to easily 1) run many threads of independent nodes; 2) re-run on failures; and 3) find nodes impacted by failure. |
|
Pulling data from all the various teams' locally created data stores and external systems to push to analytics is definitely a large problem.
I was trying to figure out if these are aimed at data transformation pipelines, or state management systems - I've got state management problems, not data transformation problems.
Slightly different problems, but both fit with "Workflow".