Hacker News new | ask | show | jobs
by 331c8c71 1305 days ago
Creating pipelines is still a problem. Typically one needs to call a bunch of other tools in order to get to the final result. There could be map/reduce behavior in the middle where chunks of data are processed in parallel in order to gain speed. And you need some kind of data management/tracking as well (putting samples in groups, ingesting raw data, exporting results). And sane monitoring especially if something breaks/fails.

There are probably 100s of tools written for this but no clear winner so far. The traditional software engineering approaches like git, ci/cd seem too heavyweight (or rather too low-level) especially during development. IMHO there could be space for a fully remote/cloud solution where one would code/debug/deploy from the browser optimized for writing/maintaining pipelines.

1 comments

I also found the quality & proliferation of data pipeline tools to be baffling. Somehow always more painful to put these together than it seemed like it ought to be.

At one point we wrote an internal tool (I think lots of organizations do this, since all the 100s of existing tools somehow don't fit, so you invent #101) and while it was tremendously satisfying getting batch jobs with 1000's of cpu's churning away, that kind of data infrastructure needs to be standardized. I think some companies are doing this, e.g. saw a presentation about Arvados/Curii that seemed interesting (but haven't used it so not sure). Maybe CWL will turn out to be the way forward here?