Hacker News new | ask | show | jobs
by kibwen 3330 days ago
Update:

  <wycats> Because of this: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomment-193772935
  <wycats> I ran some scenarios against huge repos when I first worked on cargo
  <wycats> Trying to minimize the cost of operations
  <wycats> I landed on the current strategy, and GitHub in the above thread more or less endorsed what we were already doing at that time
  <wycats> Also see https://github.com/rust-lang/cargo/issues/2452
1 comments

It's still fundamentally a waste of disk space. On my system, as of a minute ago, ~/.cargo/registry/index took up about 200MB for three different checkouts (for some reason). After deleting that and running `cargo update`, only one of them is recreated, 104MB. Out of that, 57MB is the JSON files and 47MB is git history. But if I just concatenate all the JSON files, the result is only 33MB, and after gzipping, 3MB. Hypothetically, a non-GitHub-based Cargo could store only those 3MB (using binary deltas to avoid resending it on every update), or even 0MB if it just relied on the server to resolve dependencies.
Once you've gzipped to achieve that 3MB storage, binary deltas are useless. Perhaps it the data could be (almost certainly is) transferred gzipped, then expanded to the full 33MB size so binary diffs could be applied to it later, but setting up a system to do binary diffs is a lot of incidental complexity: xdelta is a surprisingly complex format, and bsdiff is really tuned for executables, not arbitrary content (and is pretty complex too).

It sounds like the biggest win would be for cargo to keep using git, but clone the crates.io index as a bare repository rather than checking out the plaintext content. Then it would only take 47MB by your count, which is pretty close to 33MB, and you could still get out the plain content with `git cat-file` and friends.

Technically, the Cargo /already/ bundles a full copy of libxdelta as part of libgit2 (in addition to the separate Git binary delta algorithm); I just checked using nm that it's actually included in the binary. It could probably be removed, but, well, it probably adds a lot less than 44MB to the binary size :)

Alternately, since JSON is text, I suppose you could just ensure that whatever emits this hypothetical merged JSON file puts newlines between different packages' entries, and then use a regular text diff (on the uncompressed version, of course). But reading 44MB of JSON isn't instant; it would probably be better to switch to either a binary format, or even something silly like a sorted list of JSON strings separated by newlines.

There would be some incidental complexity around generating and applying the diffs… you'd probably want to precalculate them on the server side, but it could be rather expensive to, on every change, calculate a diff between the current version and every previous change. Instead, you could have daily checkpoints: each day the server would make a checkpoint and calculate a diff to the last N checkpoints; on every update the server would recalculate the diff between the latest checkpoint and HEAD. The client would store both HEAD and a reverse diff to the latest checkpoint (or just store the checkpoint separately and waste a few MB), so when it updates, it could revert to that checkpoint and request the diff from there to the new latest checkpoint; it would also request the diff from the checkpoint to the new HEAD. If its checkpoint is too old then it would just redownload from scratch.

Overall, not a trivial change, but probably not too hard either.

apt-get does something vaguely similar with its pdiff files.