| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mtrn 3447 days ago

I've worked with and looked at a lot of data processing helpers. Tools, that try to help you build data pipelines, for the sake of performance, reproducibility or simply code uniformity.

What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.

A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.

Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.

Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.

I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.

Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.

I am curious, what kind of process orchestration tools people use and can recommend?

7 comments

samuell 3446 days ago

Exactly our experience too, from complex machine learning workflows in various aspects of drug discovery.

We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.

What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.

The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.

Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:

https://github.com/pharmbio/sciluigi

We wrote up a paper on it as well, detailing a lot of the design decisions behind it:

http://dx.doi.org/10.1186/s13321-016-0179-6

Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:

https://github.com/scipipe/scipipe

It is also very much a programming library rather than a DSL.

It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).

dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!