Hacker News new | ask | show | jobs
by gsjbjt 2561 days ago
Why don’t data versioning tools like git lfs do the job? Is it lack of awareness or is the problem more complex than that?
1 comments

Well I could just be unaware of some functionality they have, but all those tools do is version things. There's no integration with reference managers like Zotero (that I'm aware of), and no tracking of interrelations or metadata.

In contrast, Zotero (and other reference managers) don't do any versioning at all (at least that I'm aware of). Instead, they keep track of the metadata that's necessary to put together a works cited section for an academic paper.

... or at least that's what they started out doing. These days they also try to organize your papers into some sort of category structure, facilitate tagging and notes, provide synchronization between your devices, and probably a few other things that don't come to mind right now.

Feature creep? Sure, but all that stuff is central to the research and writing process. It's also all tightly coupled, so splitting it between multiple tools doesn't work very well. And that's the current problem - how to integrate, for example, a few of your browser bookmarks with your academic literature collection. Or how to track a list of all the papers cited by a particular paper. Or link a specific paper tracked by your reference management software against a specific version of a large data set, perhaps itself tracked by Git LFS.

Generalizing a bit, what about linking experimental notes (typically pen and paper) with data collection software (typically a binary), as well as the collected data (perhaps Git LFS), as well as a specific version of some data analysis scripts you wrote (perhaps Git). Now try to track everything as you work on multiple paper revisions with collaborators, each version of which adds (and sometimes removes) citations and could use a different (likely newer) revision of the collection software, data set, or analysis scripts.

Alternatively, for a data management scenario not directly involving writing papers consider molecular cloning using plasmids. You have a dozen semi-related tubes in a cryogenic freezer that you need to track over many years (ie long term inventory management), each of which has one or more pieces of sequencing data attached to it (so a small data set), they're all interrelated (you create a new one by physically modifying an old one), and each has the typical meta-links to experimental protocols, notes, academic literature, and other things.

I'm not aware of any software solutions that comprehensively address all of this stuff, so people still use pen and paper. But pen and paper is time consuming, it's error prone, it doesn't sync between devices, it's slow and tedious to cross reference - all the typical problems that software is good at addressing.