|
|
|
|
|
by mickeyp
2367 days ago
|
|
This is a common problem and for one you'll find many different solutions. I used Apache Airflow some years ago to do exactly this. It's pretty good. You build a workflow of tasks (in Python) and set a schedule of how often you want it run. It then runs these tasks on any number of machines that you run the Airflow worker on to orchestrate the running of whatever it is you are trying to do. If a task fails it can notify you; and if you "miss" a run it can backfill it provided your toolchain understands the concept of time. Very useful for hourly/daily feeds that, if you miss one, the system can go back and retry it just for the slots it missed. Comes with a nice UI, too. |
|
“Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!)“
Many of our tools take on the order of 5-200gb of data and do either some transformation (which gets passed along to the next tool; similar size, possibly after another automated validation step) and/or validation (whereby this particular branch of workflow ceases).
The automated modules we have are self-contained; each task in our case is “data + config parameters in, data out”, then use “data out” as “data in” after choosing configuration parameters for the next step.
Would this still be a good usecase — am I misunderstanding what the above quote is about?