| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hhas01 2410 days ago

This is a serious question, BTW. “Lots of people are using it” is testament neither to sound architecture nor practical need. Lots of people have joined MLMs; does that mean MLMs are good and needed? Or does it just mean lots people are easily seduced by layers and layers of makework and grift?

The only bit that sounds at all novel or interesting is the dispatcher; and even that is really just an expression of `<load_balancer> | ssh`. At which point, Unix Philosophy tells us we should implement <load_balancer> as a small simple single-purpose Unix command which can easily pipe to other Unix tools. Itch scratched; everyone can now go get on with their actual work.

So if that is the case, the precisely what problem is all CWL’s Castles-in-the-sky YAML crap actually solving, other than bored developers’ need to keep entertained? Especially when [from what I can tell] the project doesn’t even provide you a dispatcher component but instead tells everyone to take a spec and write their own.

How many wheels need to be reimplemented before someone involved declares it a pig in a poke? And how many more before the rest can accept this?

2 comments

tetron 2408 days ago

I don't know if I can change your mind, or if anyone else is reading this thread, but CWL was designed to solve a particular set of problems, if you don't have those problems, you might not need it, but it doesn't mean those problems don't exist.

> the project doesn’t even provide you a dispatcher component but instead tells everyone to take a spec and write their own.

Close...

Software that supports CWL are SaaS vendors, FOSS projects, and various HPC schedulers that all have their own incompatible data management and dispatch/scheduling systems. If you want to write an analysis that runs on more than one of these platforms, you need some abstraction for it. CWL is one such an abstraction.

This matters because maybe you've developed a research pipeline that integrates a bunch of different tools written in different languages and want to run it on somebody else's data, and you need to run it on their infrastructure because copying 12 terabytes of HIPAA-restricted data from their LSF cluster to your Google cloud instance isn't an option.

"Just use bash" is what people who adopt CWL are trying to get away from. It is nearly impossible to write portable parallel / distributed analysis in bash, and the result is brittle scripts with more coordination code than code that actually does scientific work. Because CWL is declarative, the CWL engine handles all the coordination, scheduling and data staging for your particular infrastructure.

You may not have any of these needs, but suggesting that we're just bored developers creating castles in the sky is really unhelpful.

boohooimsad 2409 days ago

I'll play. CWL was designed for bioinformatics research first. We're using CWL for bioinformatics analysis, because as the "scientific workflow | data pipeline" grows (beyond 5-10 tools), bundling the execution and logical analysis together becomes difficult. If we can let researchers write just their analysis (which tools to run and what their dependencies are), and abstract the execution environment we can create more structured analysis that's portable and publishable, and also often quicker to run.

Bioinformatics software isn't perfectly written software, there are a number of weird behaviour that a simple unix pipe doesn't solve. There are engines that support CWL, and other existing engines have been adding CWL support.

I'm not saying that there aren't other frameworks out there for doing analysis, or that this is the best way but this is an option that IS working for researchers.

Edit: workflow -> scientific workflow | data pipeline