|
|
|
|
|
by arsenide
2371 days ago
|
|
We have many software programs that need to be run in sequence to transform a set of data from one form (on the order of tens to hundreds of GB) into a bunch of different output file formats. At some points the tools need to be run linearly, and at other points there are a few tools that need to be run in a branched fashion and then their outputs are combined in some complicated way using the next tool. Some of these tools take on the order of up to days to run, so an improper configuration due to human error loses time (pretty common given the complexity of work). Often the time lost is on the order of days. These tools need to be configured in certain ways depending on the business needs. Having a nice way to look at the dataflow as a whole, configure these tools on a global level within some framework, and be able to nicely distribute the work on our internal server farm would be worth a good bit of money to the company. |
|
I used Apache Airflow some years ago to do exactly this. It's pretty good. You build a workflow of tasks (in Python) and set a schedule of how often you want it run. It then runs these tasks on any number of machines that you run the Airflow worker on to orchestrate the running of whatever it is you are trying to do.
If a task fails it can notify you; and if you "miss" a run it can backfill it provided your toolchain understands the concept of time. Very useful for hourly/daily feeds that, if you miss one, the system can go back and retry it just for the slots it missed.
Comes with a nice UI, too.