Hacker News new | ask | show | jobs
by caravel 3396 days ago
[Airflow author] The task is centric to the workflow engine. It's an important entity, and it's complementary to the data lineage graph (not necessarily a DAG btw).

At Airbnb we have another important tool (not open source at the moment) that is a UI and search engine to understand all of the "data objects" and how they relate. It includes datasets, tables, charts, dashboards and tasks. The edges are usage, attribution, sources, ... This tool shows [amongst other things] data lineage and is complementary to Airflow.

1 comments

Does the other tool you're talking about that works with Airflow allow you to scale your Airflow tasks based on, for example, the amount of new input data that needs to be processed? That was one of the major challenges we see in bioinformatics workloads. Sometimes you have a few new samples to run and other times there are thousands -- so your task scheduler, although it is centric, needs to have an understanding of the data too.
At a high level [for Airflow specifically] the scheduler or workflow engine cares most about the tasks and their dependencies, and is somewhat agnostic about the units of work it triggers.

It's possible to use a feature called XCom as a message bus between tasks, but would typically direct people in the direction of having stateless, idempotent, "contained" units of work and avoid cross task communication as much as possible. https://airflow.incubator.apache.org/concepts.html#xcoms

For your case [which I have little input on] I think singleton DAGs described in another post on this page may work.