Hacker News new | ask | show | jobs
by jpollock 2369 days ago
Why are all these tools DAGs?

I'm guessing I misunderstand what is meant by a "Workflow"?

My assumption is that these are managing a state machine, where workflow is stand-in for "Business Process"? If it's doing that sort of job, I'd expect timers and loops?

However, it seems these are aimed at data conversion pipelines?

4 comments

Heyo! Data guy here. Airflow and its DAG-managing peers are important for us.

Data transformations are one thing. For us, it’s the most important thing. Our data warehouse runs as a massive DAG of nightly batched transformations over app-generated data.

We also use DAG-managing tools to call external APIs and get new data (eg for weather and geocoding) and batched ML training/inference pipelines too.

Why something like Airflow? Dependencies are easier to manage reliably. If you have hundreds or thousands of nodes in your DAG, then it is a lifesaver to be able to easily 1) run many threads of independent nodes; 2) re-run on failures; and 3) find nodes impacted by failure.

Sorry, I most definitely didn't want to make light of the problem!

Pulling data from all the various teams' locally created data stores and external systems to push to analytics is definitely a large problem.

I was trying to figure out if these are aimed at data transformation pipelines, or state management systems - I've got state management problems, not data transformation problems.

Slightly different problems, but both fit with "Workflow".

> Why are all these tools DAGs?

I don't understand your question. Perhaps the answer is that workflows naturally require data processing tasks to spawn collections of child tasks when a parent task finishes, and conversely they are also require to spawn a child data processing task only after a collection of parent tasks finish executing. Therefore this requirement to fork and join tasks ends up being modelled as a directed acyclic graph of processing tasks.

No. The tasks inside a workflow, concretely, would be things like Spark job execution, SQL query execution, download a CSV from the internet to HDFS and load it as a Hive table, etc. Think fancy cron that deals correctly with failures in multistage processes.

The number of pipelines and executions is a function of the complexity of your application, and invariant of the number of records being processed by the batch jobs within those workflows.

It has a cron scheduler