Hacker News new | ask | show | jobs
by rockmeamedee 2899 days ago
Make is often brought out for data, "single machine ETL" jobs, but for big, complicated (and iterative) workflows it doesn't feel good enough to me.

What do you folks use? Drake, "make for data" https://github.com/Factual/drake seems ok, but doesn't have "batch" jobs, (aka "pattern rules") where you can do every file in a directory matching a pattern.

Others have come up with different swiss army knives but nothing ever sticks for me, it usually ends up as a single Makefile with eg 3 targets that call a bunch of shell scripts.

The whole thing would be configurable to build from scratch, but not well set up to do incremental ETL on a per file basis, after I eg delete some extraneous rows in one file, clean up a column, redownload a folder, or add files to a dataset.

4 comments

I use Snakemake [1], a parallel make system for data, designed around pattern-matching rules. The rules are either shell commands or Python 3 code.

I settled on it after originally using make, getting frustrated with the crazy work-arounds I needed to implement because it doesn't understand build steps with multiple outputs, switching to Ninja where you have to construct the dependency tree yourself, and finally ending up on Snakemake which does everything I need.

[1] https://snakemake.readthedocs.io/en/stable/

Thank you for sharing this information about snakemake. I administer a cluster for a group of geneticists. I'll try to get them to use it for their publications to make their results easily reproducible by others.
Just today, I used xargs instead of spending a lot of time building a batching script in Python. I wanted to launch a bunch of processes in a queue but only execute 10 of them in parallel at any time.

Here is a skeleton of what I came up with.

    find $(pwd) -mindepth 1 -maxdepth 1 -type d -name ".zfs" -prune -o -type d -print0|xargs -0 -P 2 -I {} echo {}
where,

$(pwd) indicates the starting point of the listing of directories

-mindepth 1 makes sure current directory is not listed once again.

-maxdepth 1 makes sure the list does not get recursive

-type d -name - only directories and list names

".zfs" -prune - makes it ignore .zfs (snapshot directories)

-print0 - makes sure to print results without newlines. just -print will print one result per line

xargs -0 will take care of processing out spaces or newlines in the input stream

-P 2 — run two processes at once in parallel

-I {} says that replace {} in teh subsequent command from stdin piped into xargs echo {} will be echo dir1 and then echo dir2 etc

That's just an example to show that we can do a lot with standard Unix tools before bringing in the external sophistication for data related tasks.

And with GNU parallel, which can take the place of xargs, you can even distribute that job across multiple machines easily (as long as they're accessible by SSH).
Yes, I need to look into whether and how Gnu Parallel will queue up tasks if I restricted the number of parallel processes.

In my case, I was dealing with a FreeBSD server. I went the xargs route instead of installing something that is not available by default.

Someone on Twitter mentioned Luigi, which was previously developed and maintained by Spotify, as a distributed Make written with Python: https://github.com/spotify/luigi

Not sure if Spotify still uses it but it is in their Github org.

Luigi is great although I don't think it's easy to add "rerun if source file updated". Would love to be wrong on that.

http://pachyderm.io seems great but does require more engineering support (needs a kubernetes cluster)

I'm a fan of Apache Airflow for large, complicated ETL processes especially those with depth and breadth in their dependencies.