| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dwhitena 3324 days ago

I will also throw in my (biased, as I work on the project) suggestion to take a look at Pachyderm (http://pachyderm.io/). It is open source, language agnostic, and distributed. Plus it automatically tracks the provenance of all of your data pipelines, regardless of language or parallelism over your data.

Basically you set up data pipelines, where the input/output of each stage is versioned (like "git for data"). That way you have versioned sets of your data (e.g., training data), but you also can know things like exactly what model was used to produce which result, what data was used to train that particular model, what transformations occurred on that training set, etc.

Things like Airflow and Luigi are, no doubt, useful for data pipelining and some workflows (depending on what language you are working with). However, by combining pipelining and data versioning in a unified way, Pachyderm naturally lets you handle provenance of complicated pipelines, have exact reproducibility, and even do interesting things like incremental processing.