|
|
|
|
|
by 331c8c71
1305 days ago
|
|
Creating pipelines is still a problem. Typically one needs to call a bunch of other tools in order to get to the final result. There could be map/reduce behavior in the middle where chunks of data are processed in parallel in order to gain speed. And you need some kind of data management/tracking as well (putting samples in groups, ingesting raw data, exporting results). And sane monitoring especially if something breaks/fails. There are probably 100s of tools written for this but no clear winner so far. The traditional software engineering approaches like git, ci/cd seem too heavyweight (or rather too low-level) especially during development. IMHO there could be space for a fully remote/cloud solution where one would code/debug/deploy from the browser optimized for writing/maintaining pipelines. |
|
At one point we wrote an internal tool (I think lots of organizations do this, since all the 100s of existing tools somehow don't fit, so you invent #101) and while it was tremendously satisfying getting batch jobs with 1000's of cpu's churning away, that kind of data infrastructure needs to be standardized. I think some companies are doing this, e.g. saw a presentation about Arvados/Curii that seemed interesting (but haven't used it so not sure). Maybe CWL will turn out to be the way forward here?