|
Make is often brought out for data, "single machine ETL" jobs, but for big, complicated (and iterative) workflows it doesn't feel good enough to me. What do you folks use? Drake, "make for data" https://github.com/Factual/drake seems ok, but doesn't have "batch" jobs, (aka "pattern rules") where you can do every file in a directory matching a pattern. Others have come up with different swiss army knives but nothing ever sticks for me, it usually ends up as a single Makefile with eg 3 targets that call a bunch of shell scripts. The whole thing would be configurable to build from scratch, but not well set up to do incremental ETL on a per file basis, after I eg delete some extraneous rows in one file, clean up a column, redownload a folder, or add files to a dataset. |
I settled on it after originally using make, getting frustrated with the crazy work-arounds I needed to implement because it doesn't understand build steps with multiple outputs, switching to Ninja where you have to construct the dependency tree yourself, and finally ending up on Snakemake which does everything I need.
[1] https://snakemake.readthedocs.io/en/stable/