Hacker News new | ask | show | jobs
by motoboi 1051 days ago
Why do you want such big files in a git repo?
2 comments

The point is to have an easy way to distribute code as data. This is important for many areas, such as training neural networks (code with proper seeds can ensure the weights output by training), various applications in basic physics, database creation via ETL, etc.

If the choice is "run this code in the repo, wait 10 weeks while it's running, and retrieve the 50GB file", vs "download this file", of course, the latter is better. But many of these processes exist in academia, wherein you are essentially guaranteed to lose access to the server and maintenance of that file for download, it can get pretty annoying. Additionally, there's no seamless way of distributing it (it's in the docs, point somewhere else that may or may not exist, etc).

Since essentially all big data is really just code, it would make much more sense to tie these directly at the hip. So, a git/repo commit hash that is a key directly to the IPFS data hash would fix this problem directly.

So it's not "wanting big files in a git repo" (an obvious no-no, since central servers shouldn't be used for storing large data, and github centralized repos only should store single digit MB or so), it's wanting to relieve the cost of running processes that may require supercomputers weeks of processing for QM calculations, etc by providing a guaranteed hash pairing of the output of the code.

How about why not? The only reason it's not done is because git doesn't support it.
Maybe I came across as accusative, but I'm genuinely curious. Do you have 1Tb text files or this is some kind of media management for video production, something like that?
Because it's a source control system, which means it's intended to store source code, not the artifacts generated from the source code. It seems far-fetched that anyone would manage to author 1 TB of source code.
This isn't true at all. We were storing binary files separately via Maven for Java projects for almost 20 years now.

This was done with SVN projects. Keeping the blobs out of your source repos has been the preferred way for a long time.

[Edit] The only folks who seem to want to do this are game developers, and they are generally not people you would want to emulate.

Then how come git-lfs even exists at all? There's clearly a demand for it. Whether it's good practice is up for debate.

> Keeping the blobs out of your source repos has been the preferred way for a long time.

This is just appeal to tradition.

> This is just appeal to tradition.

It might be, but the arguement was that we don't do it because of git.

We haven't been doing it for a long time, but that's not because of git.