|
|
|
|
|
by dwhitena
3446 days ago
|
|
Thanks for sharing your experience. I work with Pachyderm, which is an open source data pipelining and data versioning framework. Some things like might be relevant to this conversation are the fact that Pachyderm is language agnostic and that it keeps analyses in sync with data (because it triggers off of commits to data versioning). This makes it distinct from Airflow or Luigi, for example. |
|
Only I hope to get time to test it out in some more depth sooner rather than later (it is one of my top goals for 2017).
Also, the pipeline feature in Pachyderm does not suffer from the "dependencies between tasks rather than data" problem that I mentioned in another post here, but properly identifies separate inputs and outputs declaratively.
Pachyderm specifies workflows in a kind of DSL AFAIK, and I'm very much interested to see if it could natively fit the bill for our complex workflows. But if not, I think we can always use it in a a light-weight way to fire off scipipe workflows (instead of the applications directly), and so let scipipe take care of the complex data wiring.
We would still like to benefit from the seemingly groundbreaking "git for big data" paradigm, and auto-executed workflow on updated data, which should enable something as impactful as on-line data analyses (auto-updated upon new data) in a manageable way.