|
|
|
|
|
by bastardoperator
1058 days ago
|
|
If you're storing 1TB of binary files in git, you're just doing it wrong anyways. You have a bunch of other tools and capabilities for doing this in a way that doesn't make your repository nightmarishly stupid to deal with because of its size. |
|
In most projects today, the code is (or generates, anyway) the data. This is true for materials science in physics, neural networks, and creation of databases via ETL. So, it would make sense to remove the requirement of making users of some software to regenerate this data, which may take 2 months on a supercomputer. Downloading that would be much faster. You can put it on a university server, or AWS, but now the data is in some system that is not guaranteed to be there. In fact, it's almost guaranteed to *not* be there in a very short period of time (people move positions and lose their access to these servers constantly).
So the very obvious best solution is IPFS for distribution of the data, but it does need to be linked to the git repo somehow. Of course, the data may not be simple or textual and play well with simple text based diffs for version control, so using something like borg can solve the issue of both data privacy, if needed, and block based diffs.
So this isn't to suggest "just git everything", but rather to say, 'if there's a new version control system for data and code, it's probably added some improvements to fit, and this could be a direction that makes sense'.
So I was checking to see if it had gone that direction yet.