Hacker News new | ask | show | jobs
by mickeyp 2367 days ago
This is a common problem and for one you'll find many different solutions.

I used Apache Airflow some years ago to do exactly this. It's pretty good. You build a workflow of tasks (in Python) and set a schedule of how often you want it run. It then runs these tasks on any number of machines that you run the Airflow worker on to orchestrate the running of whatever it is you are trying to do.

If a task fails it can notify you; and if you "miss" a run it can backfill it provided your toolchain understands the concept of time. Very useful for hourly/daily feeds that, if you miss one, the system can go back and retry it just for the slots it missed.

Comes with a nice UI, too.

1 comments

I’m glancing at the documentation, and came across this line:

“Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!)“

Many of our tools take on the order of 5-200gb of data and do either some transformation (which gets passed along to the next tool; similar size, possibly after another automated validation step) and/or validation (whereby this particular branch of workflow ceases).

The automated modules we have are self-contained; each task in our case is “data + config parameters in, data out”, then use “data out” as “data in” after choosing configuration parameters for the next step.

Would this still be a good usecase — am I misunderstanding what the above quote is about?

Airflow does not do the work itself; you write stuff in Python, so you _could_ make it do it, but it would be the wrong way forward for large volumes of data if time is of the essence. It merely calls out to stuff that does --- such as other tools that do the processing, and so forth.

One example is perhaps a small python script (run by airflow) to pull the files you need to process; pass them to in a downstream task that runs a shell script; which takes its output and in turn does something else entirely.