Hacker News new | ask | show | jobs
by concam 3396 days ago
I think they are referring to "event-driven" DAGs which would be both shaped and triggered dynamically. You can accomplish this now but it feels a bit hacky and is pretty clear that it goes against the Airflow paradigm of static, slowly-changing workflows
3 comments

[Airflow author here] in general, and when thinking in terms of best practices, we like to think of a DAG's shape as slowly changing, in a similar way that a database's tables definition is slowly changing. In general, you don't want to change your table's structure dynamically. This constraint brings a certain clarity and maintainability, and most use cases can be expressed this way.

Now. Airflow allows you to do what you're describing as well and will explain how to. If you were my coworker I'd dig deeper and try to understand whether the design you want is the design that is best, but let's assume it is. So first we support "externally triggered DAGs", which means those workflows don't run on a schedule, they run when they are triggered, either by some sensor, or externally in some way. A use case for that would be some company processing genomes files, and everytime a new genome file shows up, we want to run a static DAG for it. https://airflow.incubator.apache.org/scheduler.html#external...

We also support branching, meaning you can take different paths down the DAG based on what happened upstream. https://airflow.incubator.apache.org/concepts.html#branching

Now if your DAG's shape changes dramatically at every run [a shapeshifting DAG!], I would argue that conceptually they are different DAGs, and would instruct to build "singleton" DAGs dynamically. Meaning you have python code that creates a dag object [with its own dag_id] for each "instance", with the schedule_interval='@once', meaning each DAG will run only once. You can shape each DAG individually, from that same script, and craft whatever dependency you might like for each one.

Though all of this is not only possible and easy-ish to do, it may not be the best approach. Try to think of your DAGs and tables as static [or slowly changing] if you can, and the data as the variable.

As an analogy, try to think of an oil pipeline that changes shape based on the quality of the oil it processes. Crazy?! It's easier to think of the pipeline as static and infrastructure, and to have components that can sort and direct the flow in [existing and static] pipes.

Starting from 1.8 you will be able to trigger dags through a rest API, that is fully supported.

Shaping DAGs dynamically poses a challenge to the scheduler on how to 'predict' what tasks need to run in the future. The scheduler needs to evaluate which tasks will need to run, without actually executing these tasks themselves. For Airflow in its current state that is a chicken and egg problem.

For the future, I can think of allowing dynamic dags being described through the Rest API, but that is definitely further out and has not really popped up yet on the horizon.

Yes, exactly my point.