Hacker News new | ask | show | jobs
by trumpeta 1495 days ago
We operate a (small?) Airflow instance with ~20 DAGs but, one of those dags has ~1k tasks. It runs on k8s/aws setup with a MySQL backing it.

We package all the code in 1-2 different Docker images and then create the DAG. We've faced many issues (logs out of order, missing, random race conditions, random task failures, etc.)

But what annoys me the most is that for that 1 big DAG, the UI is completely useless, tree view has insane dupplication, graph view is super slow and hard to navigate through and answering basic questions like, what exactly failed and what nodes are around it are not easy.

4 comments

At Airbnb, we were using SubDAGs to try to manage large number of tasks in a single DAG. This allowed organizing tasks and drilling down into failures more easily but came with its own challenges.

In more recent versions of Airflow, TaskGroups (https://airflow.apache.org/docs/apache-airflow/stable/concep..., https://www.astronomer.io/guides/task-groups/ ) were made to help this a little bit. Hopefully that helps a bit.

At ~1k nodes in the graph introspection becomes hard anyway, as others have suggested, breaking it down if possible might be a good idea.

We had a similar DAG that was the result of migration a single daily Luigi pipeline to Airflow. I started identifying isolated branches and breaking them off with external task sensors back to the main DAG. This worked but it's a pain in the ass. My coworker ended up exporting the graph to graphviz and started identifying clusters of related tasks that way.
I've not had the best luck with ExternalTaskSensors. There have been some odd errors like execution failing at 22:00:00 every day (despite the external task running fine).
Also, the @task annotation provides no facilities to name tasks. So if you like to build reusable tasks (as I do), you end up with my_generic_task__1, my_generic_task__2, my_generic_task__n. I've tried a few hacks to dynamically rename these, but I just ended up bringing down my entire staging cluster.
`your_task.override(task_id="your_generated_name")` not working for you?
I got pretty excited when I read this response, but no, it doesn't work. I'm not sure how this would work since annotated tasks return an xcom object.

Can you point me to the documentation on this function? It's possible I'm not using it correctly.

I can do something like this, which works locally, but breaks when deployed:

    res = annotated_task_function(...)
    res.operator.task_id = 'manually assigned task id'
@task.python(task_id="this_is_my_task_name")

def my_func():

...

This still has the problem that, when you call my_func multiple times in the same dag, the resulting tasks will be labelled, my_func, my_func__1, my_func__2, ...
How about the dynamic task mapping that is now available in 2.3?
Does this imply file metadata content can effect the access performance of those files even for operations that do not directly concern the metadata?