Hacker News new | ask | show | jobs
by chaxor 1047 days ago
Are you simply using it with GitHub repos?

It mentions that it can be used with backends like Dropbox, but it would be wonderful if we finally had a system that could easily be used with IPFS. This is especially important for large data, since you can't store 1TB on github (and no, I don't count lfs, since you have to pay for it).

IPFS is the natural solution here, since everyone that wants to use the dataset has it locally anyway, and having thousands of sources to download from is better than just one.

So if this uses IPFS for the data repo, I'm switching immediately. If it doesn't, it's not worth looking into.

6 comments

If you're storing 1TB of binary files in git, you're just doing it wrong anyways. You have a bunch of other tools and capabilities for doing this in a way that doesn't make your repository nightmarishly stupid to deal with because of its size.
I didn't exactly intend it to operate precisely the same way that git does, but rather to have extensions of git that unify the system into one easy to use version control for data and code.

In most projects today, the code is (or generates, anyway) the data. This is true for materials science in physics, neural networks, and creation of databases via ETL. So, it would make sense to remove the requirement of making users of some software to regenerate this data, which may take 2 months on a supercomputer. Downloading that would be much faster. You can put it on a university server, or AWS, but now the data is in some system that is not guaranteed to be there. In fact, it's almost guaranteed to *not* be there in a very short period of time (people move positions and lose their access to these servers constantly).

So the very obvious best solution is IPFS for distribution of the data, but it does need to be linked to the git repo somehow. Of course, the data may not be simple or textual and play well with simple text based diffs for version control, so using something like borg can solve the issue of both data privacy, if needed, and block based diffs.

So this isn't to suggest "just git everything", but rather to say, 'if there's a new version control system for data and code, it's probably added some improvements to fit, and this could be a direction that makes sense'.

So I was checking to see if it had gone that direction yet.

One can still use Subversion to store binary files in VCS...
Nexus, Artifactory, Packages (deb, rpm, nix), Cache, GitHub Releases... There are so many places you can grab a signed binary from that are just outright better for the health of your repo, and will respect your developers time.
The issue is not limited to archives, artifacts, and packaging. Game projects, for example, have large directories with many binary assets which need to be change-controlled. Artifact repositories address distribution, in a way, but don't generally support change control much if at all.

(And yeah, git's historically a poor choice for this – so you may see companies sticking with Perforce or other non-distributed solutions.)

The `Backend` interface is not that wide: https://github.com/martinvonz/jj/blob/48b1a1c533f16fc5df5269.... Mostly it just handles reading/writing various objects. You could very plausibly add IPFS support yourself!
In the author’s presentation [0], the Google roadmap includes “custom working copy implementation for our internal distributed VFS”. The related graphic shows a “working copy” block connected to a “distributed file system block”.

This work might be extensible to include IPFS and other distributed virtual file systems.

[0] https://docs.google.com/presentation/d/1F8j9_UOOSGUN9MvHxPZX...

There are two questions in play here:

1. When we ingest files or make new commits, how are these additions to the object store persisted?

2. When operations modify the working copy, how should these changes be reflected in the user's view of that working copy?

Ordinary git handles (2) by directly modifying the files on the filesystem. If you `git checkout` a branch, git will `rm` nonexistent files, `open()` and `write()` new ones, and adjust modification timestamps etc as needed. As you make changes to these files, some commands will occasionally "notice" that the file changed after the fact, and some may choose to modify the index to match.

the jj on github also does this, but inside Google, our concept of "working copy" needs to be disconnected from local files. Developers don't have their own local "working copy" backed by files on the ordinary filesystem; instead, we do all development inside a FUSE-mounted virtual FS called "client in the cloud" (CitC), so working anywhere inside our giant monorepo doesn't take any disk space (except caching). I think that's what the "Distributed file system" refers to - instead of modifying the local filesystem, jj would need to talk to whatever remote service provides the user's FUSE-backed view whenever the user uses `jj checkout` or some other jj operation that modifies the working copy.

When you speak of implementing IPFS storage, I think instead you want to keep the object store and operation log on IPFS while keeping the local working copy right on the ordinary file system, similar to how git-LFS keeps local files untouched while modifying the way they're persisted to the git object store.

Alternatively, perhaps we could imagine an IPFS backend similar to `jj git` and `jj native`, perhaps `jj ipfs push/pull`. Then, a completely local repository could push/pull to and from IPFS, completely agnostic of how the user's repository is stored on disk.

In any case, jujitsu's API surface is flexible enough to support any of these use cases since the author designed it from the ground up to smoothly support very different needs for internal and external users. Most users outside Google just want a familiar working copy containing ordinary files, and the fact that the repository structure happens to be backed by a git-like object store (linus' "git is just a merkle tree of files" philosophy) is incidental under the hood. That's just fine, even though most internal users will be interacting with a very different way of using jj when everything's said and done. Ideally, nobody needs to notice or care about the difference.

[1]: More about Google's internal VCS needs: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

[2]: Linus Torvalds on git: “In many ways you can just see git as a filesystem — it’s content addressable, and it has a notion of versioning, but I really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have absolutely zero interest in creating a traditional SCM system.”

IPFS didn't always seem to have pinning services at BitTorrent prices($3-$5 a month with 1TB bandwidth, no crypto wallet needed) and the client usew tens to hundreds of kb per second idle.

Unfortunately, the model of putting every block in the DHT instead of having roots mode be the default, and then spamming your wantlist to tons of peers seems to still be at least partly around.

Right now IPFS looks pretty good thanks to the gateways and services, so I would imagine well see more of it in the future, but I can see why it took so long.

Yea, just GitHub repos. In fact just a single repo (work's mono repo) for now, but that's where I spend the majority of my day.
Why do you want such big files in a git repo?
The point is to have an easy way to distribute code as data. This is important for many areas, such as training neural networks (code with proper seeds can ensure the weights output by training), various applications in basic physics, database creation via ETL, etc.

If the choice is "run this code in the repo, wait 10 weeks while it's running, and retrieve the 50GB file", vs "download this file", of course, the latter is better. But many of these processes exist in academia, wherein you are essentially guaranteed to lose access to the server and maintenance of that file for download, it can get pretty annoying. Additionally, there's no seamless way of distributing it (it's in the docs, point somewhere else that may or may not exist, etc).

Since essentially all big data is really just code, it would make much more sense to tie these directly at the hip. So, a git/repo commit hash that is a key directly to the IPFS data hash would fix this problem directly.

So it's not "wanting big files in a git repo" (an obvious no-no, since central servers shouldn't be used for storing large data, and github centralized repos only should store single digit MB or so), it's wanting to relieve the cost of running processes that may require supercomputers weeks of processing for QM calculations, etc by providing a guaranteed hash pairing of the output of the code.

How about why not? The only reason it's not done is because git doesn't support it.
Maybe I came across as accusative, but I'm genuinely curious. Do you have 1Tb text files or this is some kind of media management for video production, something like that?
Because it's a source control system, which means it's intended to store source code, not the artifacts generated from the source code. It seems far-fetched that anyone would manage to author 1 TB of source code.
This isn't true at all. We were storing binary files separately via Maven for Java projects for almost 20 years now.

This was done with SVN projects. Keeping the blobs out of your source repos has been the preferred way for a long time.

[Edit] The only folks who seem to want to do this are game developers, and they are generally not people you would want to emulate.

Then how come git-lfs even exists at all? There's clearly a demand for it. Whether it's good practice is up for debate.

> Keeping the blobs out of your source repos has been the preferred way for a long time.

This is just appeal to tradition.

> This is just appeal to tradition.

It might be, but the arguement was that we don't do it because of git.

We haven't been doing it for a long time, but that's not because of git.