|
|
|
|
|
by hhas01
2410 days ago
|
|
This is a serious question, BTW. “Lots of people are using it” is testament neither to sound architecture nor practical need. Lots of people have joined MLMs; does that mean MLMs are good and needed? Or does it just mean lots people are easily seduced by layers and layers of makework and grift? The only bit that sounds at all novel or interesting is the dispatcher; and even that is really just an expression of `<load_balancer> | ssh`. At which point, Unix Philosophy tells us we should implement <load_balancer> as a small simple single-purpose Unix command which can easily pipe to other Unix tools. Itch scratched; everyone can now go get on with their actual work. So if that is the case, the precisely what problem is all CWL’s Castles-in-the-sky YAML crap actually solving, other than bored developers’ need to keep entertained? Especially when [from what I can tell] the project doesn’t even provide you a dispatcher component but instead tells everyone to take a spec and write their own. How many wheels need to be reimplemented before someone involved declares it a pig in a poke? And how many more before the rest can accept this? |
|
> the project doesn’t even provide you a dispatcher component but instead tells everyone to take a spec and write their own.
Close...
Software that supports CWL are SaaS vendors, FOSS projects, and various HPC schedulers that all have their own incompatible data management and dispatch/scheduling systems. If you want to write an analysis that runs on more than one of these platforms, you need some abstraction for it. CWL is one such an abstraction.
This matters because maybe you've developed a research pipeline that integrates a bunch of different tools written in different languages and want to run it on somebody else's data, and you need to run it on their infrastructure because copying 12 terabytes of HIPAA-restricted data from their LSF cluster to your Google cloud instance isn't an option.
"Just use bash" is what people who adopt CWL are trying to get away from. It is nearly impossible to write portable parallel / distributed analysis in bash, and the result is brittle scripts with more coordination code than code that actually does scientific work. Because CWL is declarative, the CWL engine handles all the coordination, scheduling and data staging for your particular infrastructure.
You may not have any of these needs, but suggesting that we're just bored developers creating castles in the sky is really unhelpful.