Hacker News new | ask | show | jobs
by arsenide 2373 days ago
I’m glancing at the documentation, and came across this line:

“Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!)“

Many of our tools take on the order of 5-200gb of data and do either some transformation (which gets passed along to the next tool; similar size, possibly after another automated validation step) and/or validation (whereby this particular branch of workflow ceases).

The automated modules we have are self-contained; each task in our case is “data + config parameters in, data out”, then use “data out” as “data in” after choosing configuration parameters for the next step.

Would this still be a good usecase — am I misunderstanding what the above quote is about?

1 comments

Airflow does not do the work itself; you write stuff in Python, so you _could_ make it do it, but it would be the wrong way forward for large volumes of data if time is of the essence. It merely calls out to stuff that does --- such as other tools that do the processing, and so forth.

One example is perhaps a small python script (run by airflow) to pull the files you need to process; pass them to in a downstream task that runs a shell script; which takes its output and in turn does something else entirely.