Hacker News new | ask | show | jobs
by tetron 2405 days ago
CWL is a declarative/functional language for describing how to execute command line tools (staging input files, mapping arguments, collecting outputs) and how to connect the outputs of one tool to the inputs of the next. It is HPC and cloud agnostic. The same workflow description can run on a laptop or on 1000 cloud compute nodes. Lots of people are already using it to solve their problems, for some examples see https://github.com/search?q=extension%3Acwl+cwlVersion
1 comments

Quickly skimming the CWL User Guide, calling that a “language” is a huge stretch. Looks like writing a bunch of YAML config files which let you write more YAML files to construct a UNIX pipeline and run it.

In which case, why didn’t you just use bash?

Yeah, I know bash stinks. But this is not my fist rodeo* so I’m struggling here to see how CWL stinks less. See again: Inner-Platform Effect, Greenspun’s Tenth Rule.

--

* i.e. I already know how easily “workflow engines” go up their own arse because I’ve done it myself. And CWL fails the same sniff test. Not a good start.

This is a serious question, BTW. “Lots of people are using it” is testament neither to sound architecture nor practical need. Lots of people have joined MLMs; does that mean MLMs are good and needed? Or does it just mean lots people are easily seduced by layers and layers of makework and grift?

The only bit that sounds at all novel or interesting is the dispatcher; and even that is really just an expression of `<load_balancer> | ssh`. At which point, Unix Philosophy tells us we should implement <load_balancer> as a small simple single-purpose Unix command which can easily pipe to other Unix tools. Itch scratched; everyone can now go get on with their actual work.

So if that is the case, the precisely what problem is all CWL’s Castles-in-the-sky YAML crap actually solving, other than bored developers’ need to keep entertained? Especially when [from what I can tell] the project doesn’t even provide you a dispatcher component but instead tells everyone to take a spec and write their own.

How many wheels need to be reimplemented before someone involved declares it a pig in a poke? And how many more before the rest can accept this?

I don't know if I can change your mind, or if anyone else is reading this thread, but CWL was designed to solve a particular set of problems, if you don't have those problems, you might not need it, but it doesn't mean those problems don't exist.

> the project doesn’t even provide you a dispatcher component but instead tells everyone to take a spec and write their own.

Close...

Software that supports CWL are SaaS vendors, FOSS projects, and various HPC schedulers that all have their own incompatible data management and dispatch/scheduling systems. If you want to write an analysis that runs on more than one of these platforms, you need some abstraction for it. CWL is one such an abstraction.

This matters because maybe you've developed a research pipeline that integrates a bunch of different tools written in different languages and want to run it on somebody else's data, and you need to run it on their infrastructure because copying 12 terabytes of HIPAA-restricted data from their LSF cluster to your Google cloud instance isn't an option.

"Just use bash" is what people who adopt CWL are trying to get away from. It is nearly impossible to write portable parallel / distributed analysis in bash, and the result is brittle scripts with more coordination code than code that actually does scientific work. Because CWL is declarative, the CWL engine handles all the coordination, scheduling and data staging for your particular infrastructure.

You may not have any of these needs, but suggesting that we're just bored developers creating castles in the sky is really unhelpful.

I'll play. CWL was designed for bioinformatics research first. We're using CWL for bioinformatics analysis, because as the "scientific workflow | data pipeline" grows (beyond 5-10 tools), bundling the execution and logical analysis together becomes difficult. If we can let researchers write just their analysis (which tools to run and what their dependencies are), and abstract the execution environment we can create more structured analysis that's portable and publishable, and also often quicker to run.

Bioinformatics software isn't perfectly written software, there are a number of weird behaviour that a simple unix pipe doesn't solve. There are engines that support CWL, and other existing engines have been adding CWL support.

I'm not saying that there aren't other frameworks out there for doing analysis, or that this is the best way but this is an option that IS working for researchers.

Edit: workflow -> scientific workflow | data pipeline