Hacker News new | ask | show | jobs
by appplication 733 days ago
To your comment on Airflow, I’ve been around that block a few times. I’ve found Airflow (and really any orchestration) be the most manageable when it’s nearly devoid of all logic to the point of DAGs being little more than a series of function or API calls, with each of those responsible for managing state transfer to the next call (as opposed to relying on orchestration to do so).

For example, you need some ETL to happen every day. Instead of having your pipeline logic inside an airflow task, you put your logic in a library, where you can test and establish boundaries for this behavior in isolation, and compose this logic portably into any system that can accept your library code. When you need to orchestrate, you just call this function inside an airflow task.

This has a few benefits. You now decouple, to a significant extent, your logic and state transfer from your orchestration. That means if you want to debug your DAG, you don’t need to do it in Airflow. You can take the same series of function calls and run them, for example, sequentially in a notebook and you would achieve the same effect. This also can reveal just how little logic you really need in orchestration.

There are some other tricks to making this work really well, such as reducing dependency injection to primatives only where possible, and focusing on decoupling logic from configuration. Some of this is pretty standard, but I’ve seen teams not have a strong philosophy on this and then struggle with maintaining clean orchestration interfaces.

4 comments

Helpful comment! If I could pick your brain...

I'm looking at a green field implementation of a task system, for human tasks - people need to do a thing, and then mark that they've done it, and that "unlocks" subsequent human tasks, and near as I can tell the overall task flow is a DAG.

I'm currently considering how (if?) to allow for complex logic about things like which tasks are present in the overall DAG - things like skipping a node based on some criteria (which, it occurs to me in typing this up, can benefit from your above advice, as that can just be a configured function call that returns skip/no-skip) - and, well... thoughts? (:

I think there are some questions to ask that can help drive your system design here. Does each node in the DAG represent an event at which some complex automated logic would happen? If so, then I think the above would be recommended, since most of your logic isn’t the DAG itself, and the DAG is just the means of contextually triggering it.

However, if each node is more of a data check/wait (e.g. we’re on this step until you tell me you completed some task in the real world), then it would seem rather than your DAG orchestrating nodes of logic, the DAG itself is the logic. In this case, i think you have a few options, though Airflow itself is probably not something I would recommend for such a system.

In the case of the latter, there are a lot of specifics to consider in how it’s used. Is this a universal task list, where there is exactly one run of this DAG (e.g. tracking tasks at a company level), or would you have many independent runs of this (e.g. many users use it), are runs of it regularly scheduled (e.g. users run it daily, or as needed).

Without knowing a ton about your specifics, a pattern I might consider could be isolating your logic from your state, such that you have your logical DAG code, baked into a library of reusable components (a la the above), and then allowing those to accept configuration/state inputs that allow them to route logic appropriately. As a task is completed, update your database with the state as it relates to the world, not its place in the DAG. This will keep your state isolated from the logic of the DAG itself, which may or may not be desirable, depending on your objectives and design parameters.

Do you avoid things like task sensors? Based off what you described it sounds like an anti pattern if you’re using them.

Great description of good orchestration design. Airflow is fairly open ended in how you can construct dags, leading to some interesting results.

Yes, I think you could make an argument for them, but in general it means putting your state sensing into orchestration (local truth) rather than something external (universal truth). As with anything, it does depend on your application though. If you were running something like an ETL, I think it’s generally more appropriate to sense the output of that ETL (data artifact, table, partition, etc) than it is to sense the task itself. It does present some challenges for e.g. cascading backfills, but I think it’s a fine tradeoff in most applications.
If you’re already in the Kubernetes system, Argo Workflows has either capabilities designed around what you are describing or can be built using the templates supported (container, script, resource). If you’re not on Kubernetes, then Argo Workflows is not worth it on its own because it does demand expertise there to wield it effectively.

Someone suggested Temporal below and that’s a good suggestion too if you’re fine with a managed service.

Not GP or specifically Airflow user; but my approach is to have a fixed job graph, and unnecessary jobs immediately succeed. And indeed, jobs are external executables, with all the skip/no skip logic executed therein.

If nothing else, it makes it easy to understand what actually happened and when - just look at job logs.

I’m working on similar system. My plan is to have multiple terminal states for the tasks:

Closed - Passed

Closed - Failed

Closed - Waived

When you hit that Waived state, it should include a note explaining why it was waived. This could be “parent transaction dropped below threshold amount, so we don’t need this control” or “Executive X signed off on it”.

I’m not sure about the auto-skip thing you propose, just from a UX perspective. I don’t want my task list cluttered up with unnecessary things. Still, I am struggling with precisely where to store the business logic about which tasks are needed when. I’m leaning towards implementing that in a reporting layer. Validation would happen in the background and raise warnings, rather than hard stopping people.

The theory there is that the people doing the work generally know what’s needed better than the system does. Thus the system just provides gentle reminders about the typical case, which users can make the choice to suppress.

I think of jobs rather as of prerequisites. If a prerequisite is somehow automatically satisfied (dunno, only back up on Mondays, and today is Tuesday) then it succeeds immediately. There is no "skipping". Wfm.

I find embedding logic into DSLs usually quite painful and less portable than having a static job graph and all the logic firmly in my own code.

Tbh that sounds almost like an already built workflow engine like n8n or even Jira would be preferable to reinventing the wheel.
Have you looked into temporal.io? It supports dynamic workflows.
Ok, so question (because I really like the DAG approach in principle but don't have enough experience to have had my fingers burned yet):

The way you use Airflow, what advantage does it have over crontab? Or to put it another way, once you remove the pipeline logic, what's left?

Airflow provides straightforward parallelism and error handling of dependent subtasks. Cron really doesn’t.

With cron you have to be more thoughtful about failover especially when convincing others to write failure safe cron in invoked code. With airflow you shouldn’t be running code locally so you can have a mini framework for failure handling.

Cron doesn’t natively provide singleton locking so if the system bogs down you can end up running N of the same jobs at the same time which slows things down further. Airflow isn’t immune to this by default but it’s easier to setup centralized libraries that everything uses so more junior people avoid this when writing quick one off jobs.

Observability is a huge upside.
Backfilling is also very useful
Thanks to both comments.
This is exactly what we do, but with Spark instead. We develop the functions locally in a package and call necessary functions for the job notebooks, and the job notebooks are very minimalistic because of this
Spark-via-Airflow is also the context we use this, glad of see the pattern also works for you.
Thanks, this was really helpful.