Hacker News new | ask | show | jobs
by samuell 3453 days ago
Exactly our experience too, from complex machine learning workflows in various aspects of drug discovery.

We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.

What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.

The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.

Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:

https://github.com/pharmbio/sciluigi

We wrote up a paper on it as well, detailing a lot of the design decisions behind it:

http://dx.doi.org/10.1186/s13321-016-0179-6

Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:

https://github.com/scipipe/scipipe

It is also very much a programming library rather than a DSL.

It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).

dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!

1 comments

Have you checked out airflow? Any opinions?
I have looked a bit at code examples of Airflow, but was worried that it seems to have a similar problem as a lot of other pipeline tools: That in the main workflow specification, dependencies are specified between tasks only, not between the individual inputs and outputs of each task (between tasks rather than data).

This means that this info needs to be implemented "manually" in some less declarative manner somewhere else, breaking the declarative-ness of the workflow specification.

I have posted about it some time ago here, mentioning AirFlow specifically: http://bionics.it/posts/workflows-dataflow-not-task-deps

We wrote a package to go with our Airflow installation to borrow some of the data flow (as opposed to Airflow's exclusive task deps flow you mention) concepts we liked from Make/Drake/Luigi. You may be interested: github.com/industrydive/fileflow
That's nice! Didn't know Airflow did in-memory passing (as I now understand it does?), so I can see that this must be needed for larger data items, right?

Does it also help with making it easier to route individual multiple outputs to separate downstream components etc?

thanks!