Hacker News new | ask | show | jobs
by caravel 3394 days ago
[Airflow author here] in general, and when thinking in terms of best practices, we like to think of a DAG's shape as slowly changing, in a similar way that a database's tables definition is slowly changing. In general, you don't want to change your table's structure dynamically. This constraint brings a certain clarity and maintainability, and most use cases can be expressed this way.

Now. Airflow allows you to do what you're describing as well and will explain how to. If you were my coworker I'd dig deeper and try to understand whether the design you want is the design that is best, but let's assume it is. So first we support "externally triggered DAGs", which means those workflows don't run on a schedule, they run when they are triggered, either by some sensor, or externally in some way. A use case for that would be some company processing genomes files, and everytime a new genome file shows up, we want to run a static DAG for it. https://airflow.incubator.apache.org/scheduler.html#external...

We also support branching, meaning you can take different paths down the DAG based on what happened upstream. https://airflow.incubator.apache.org/concepts.html#branching

Now if your DAG's shape changes dramatically at every run [a shapeshifting DAG!], I would argue that conceptually they are different DAGs, and would instruct to build "singleton" DAGs dynamically. Meaning you have python code that creates a dag object [with its own dag_id] for each "instance", with the schedule_interval='@once', meaning each DAG will run only once. You can shape each DAG individually, from that same script, and craft whatever dependency you might like for each one.

Though all of this is not only possible and easy-ish to do, it may not be the best approach. Try to think of your DAGs and tables as static [or slowly changing] if you can, and the data as the variable.

As an analogy, try to think of an oil pipeline that changes shape based on the quality of the oil it processes. Crazy?! It's easier to think of the pipeline as static and infrastructure, and to have components that can sort and direct the flow in [existing and static] pipes.