| HN Mirror

Pachyderm, with its "git for big data" approach is one of, if not THE, coolest thing I learned about in 2016.

Only I hope to get time to test it out in some more depth sooner rather than later (it is one of my top goals for 2017).

Also, the pipeline feature in Pachyderm does not suffer from the "dependencies between tasks rather than data" problem that I mentioned in another post here, but properly identifies separate inputs and outputs declaratively.

Pachyderm specifies workflows in a kind of DSL AFAIK, and I'm very much interested to see if it could natively fit the bill for our complex workflows. But if not, I think we can always use it in a a light-weight way to fire off scipipe workflows (instead of the applications directly), and so let scipipe take care of the complex data wiring.

We would still like to benefit from the seemingly groundbreaking "git for big data" paradigm, and auto-executed workflow on updated data, which should enable something as impactful as on-line data analyses (auto-updated upon new data) in a manageable way.