|
|
|
|
|
by arsenide
2373 days ago
|
|
I’m glancing at the documentation, and came across this line: “Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!)“ Many of our tools take on the order of 5-200gb of data and do either some transformation (which gets passed along to the next tool; similar size, possibly after another automated validation step) and/or validation (whereby this particular branch of workflow ceases). The automated modules we have are self-contained; each task in our case is “data + config parameters in, data out”, then use “data out” as “data in” after choosing configuration parameters for the next step. Would this still be a good usecase — am I misunderstanding what the above quote is about? |
|
One example is perhaps a small python script (run by airflow) to pull the files you need to process; pass them to in a downstream task that runs a shell script; which takes its output and in turn does something else entirely.