|
To your comment on Airflow, I’ve been around that block a few times. I’ve found Airflow (and really any orchestration) be the most manageable when it’s nearly devoid of all logic to the point of DAGs being little more than a series of function or API calls, with each of those responsible for managing state transfer to the next call (as opposed to relying on orchestration to do so). For example, you need some ETL to happen every day. Instead of having your pipeline logic inside an airflow task, you put your logic in a library, where you can test and establish boundaries for this behavior in isolation, and compose this logic portably into any system that can accept your library code. When you need to orchestrate, you just call this function inside an airflow task. This has a few benefits. You now decouple, to a significant extent, your logic and state transfer from your orchestration. That means if you want to debug your DAG, you don’t need to do it in Airflow. You can take the same series of function calls and run them, for example, sequentially in a notebook and you would achieve the same effect. This also can reveal just how little logic you really need in orchestration. There are some other tricks to making this work really well, such as reducing dependency injection to primatives only where possible, and focusing on decoupling logic from configuration. Some of this is pretty standard, but I’ve seen teams not have a strong philosophy on this and then struggle with maintaining clean orchestration interfaces. |
I'm looking at a green field implementation of a task system, for human tasks - people need to do a thing, and then mark that they've done it, and that "unlocks" subsequent human tasks, and near as I can tell the overall task flow is a DAG.
I'm currently considering how (if?) to allow for complex logic about things like which tasks are present in the overall DAG - things like skipping a node based on some criteria (which, it occurs to me in typing this up, can benefit from your above advice, as that can just be a configured function call that returns skip/no-skip) - and, well... thoughts? (: