| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chaxor 1051 days ago

The point is to have an easy way to distribute code as data. This is important for many areas, such as training neural networks (code with proper seeds can ensure the weights output by training), various applications in basic physics, database creation via ETL, etc.

If the choice is "run this code in the repo, wait 10 weeks while it's running, and retrieve the 50GB file", vs "download this file", of course, the latter is better. But many of these processes exist in academia, wherein you are essentially guaranteed to lose access to the server and maintenance of that file for download, it can get pretty annoying. Additionally, there's no seamless way of distributing it (it's in the docs, point somewhere else that may or may not exist, etc).

Since essentially all big data is really just code, it would make much more sense to tie these directly at the hip. So, a git/repo commit hash that is a key directly to the IPFS data hash would fix this problem directly.

So it's not "wanting big files in a git repo" (an obvious no-no, since central servers shouldn't be used for storing large data, and github centralized repos only should store single digit MB or so), it's wanting to relieve the cost of running processes that may require supercomputers weeks of processing for QM calculations, etc by providing a guaranteed hash pairing of the output of the code.