Hacker News new | ask | show | jobs
by firecraker 1039 days ago
So my question to the non bioinformatics - is this already a solved problem?

You have tasks which require resources based on the input parameters, these are run in docker containers to ensure the environment and you want to track the output of each step. Often these are embarrassingly parallel operations (e.g. I have 200 samples to do the same thing on).

Something like dask perhaps,but can specify a docker image for the task?

What is the goto in DevOps for similar tasks? GitHub actions comes pretty close...

To bioinformatics what is the unique selling point of next flow over say wdl/Cromwell?

2 comments

I do computational physics and I use Snakemake. On HPCs, we only have user-level access. We are not allowed to perform long-running processes on login nodes, could be killed at any time due to a violation of rules. That said, anything that depends on Docker is a no-go; anything that uses a server-client structure we will try to avoid (although it might be possible for us to host a daemon elsewhere that we as students pay out of our own pocket). We also deal with a lot of tools that are not well-written in Python or modern languages, you wouldn't want to build any CFFI onto it.

So Snakemake, and similarly, Nextflow, suits our needs well. It is a user-space CLI tool that does not require any privileges, it optimizes for running bash command / any CLI-based tools. A bonus for Snakemake is that it uses Python and our other scripts use Python too.

So I guess DevOps tooling, which heavily bias towards docker or whatever container-based execution, is really a different space.

The big difference when comparing bioinformatics systems with non are what the typical payload of a DAG node is and what optimizations that indicates. Most other domains don’t have DAG nodes that assume the payload is a crappy command line call and expecting inputs/outputs to magically be in specific places on a POSIX file system.

You can do this on other systems but it’s nice to have the headache abstracted away for you.

The other major difference is assumption of lifecycle. In most biz domains you don’t have researchers iterating on these things the way you do in bioinf. The newer ML/DS systems do solve this problem than say Aorflow

I for one have started to appreciate the fact that the shell/commandline interface means:

- We have an interface that very strongly imposes composability, that is rarely seen in other parts of IT, and making people actually "follow the rules" :D

- Data is (mostly) treated as immutable, except perhaps inside tools

- Data is cached

- The cli boundaries means that at least one can inspect inputs/outputs as a way to debug.

- Etc...

Personally, the biggest frustration is all the inconsistencies in how people design the commandline interfaces. Primarily that output filenames are so often created based on non-obvious and sometimes arbitrary rules, rather than being specified by the user. If all filenames were specified (or at least possible to specify) via the CLI, pipeline managers would have such an enormously easier time.

What happens now is that you basically need a mechanism like Nextflow has, where all commands are executed in a temp directory, and the pipeline tool just globs up all the generated files afterwards. This works, but opens a lot of possibilities for mistakes in how files are tracked (might be routed to the wrong downstream output, if you do something funny with the naming, such that two output path patterns overlap).

nextflow can't even get this right- base nextflow uses some combination of `--paramName` and `--param-name` and treats them as interchangeable, while nf-core encourages `--param_name` (but nextflow sees that as different). All trivial differences but just layers on the CLI frustration train.