Hacker News new | ask | show | jobs
by andrewchambers 4156 days ago
Isn't this how git works? I would have thought bit torrent did this all along.
3 comments

Not exactly. Git uses per-file, per-commit, and per-directory hashes. It does make up a sort of hash tree, but the tree does not descend within a given file. You need to hash the entire file and determine its hash to know if any of the pieces you have are valid. This would be a problem for large files if many of those pieces come from untrusted sources -- you'd have to spend a lot of time/bandwidth downloading invalid data before you realized it. This isn't generally the case for Git, but it for BitTorrent.

BitTorrent traditionally solves this using a hash list. All of the data in a torrent is broken up between pieces of a chosen size, and hashes are calculated for each of those pieces individually. (These are generally not aligned with file boundaries, which is why you may have noticed that even if you tell your torrent to only download a certain set of files, you may still end up with some data from adjacent files.)

This entirely hash list is included in the torrent file. For torrents with a lot of data, this could be 10MB or more.

If you're using a magnet link, that torrent file needs to be downloaded from peers. This brings back the original problem: you need to download this entire large file before you know that the peer isn't just sending you random data.

BEP-30 proposes a solution: generate a binary hash tree whose leaves are the torrent pieces, and include only the single root hash in the torrent file to keep it small. When you're getting pieces from a peer, they send you the missing inner hashes of the tree that you need to verify the piece.

The minimum data transfer in ideal circumstances is increased a bit, but the peer-to-peer system is made more robust, able to identify invalid data much faster.

I think it's a great modification to the protocol. Unfortunately it isn't widely-enough supported to be practical for general use.

>The minimum data transfer in ideal circumstances is increased by log(N)

There should be roughly as many internal nodes as there are leaves, so there is a linear space increase. As the leaves are much bigger than the internal nodes, the linear factor is small.

Yes it is, but it makes much more sense for git to use it: if a single byte is changed somewhere, you only need re-hashing the file and the trees up to the root; the other blobs/trees are left unchanged. It is very useful in this setting to reuse older existing hashes. In bittorrent, OTOH, we suppose every swarm will share different content, so there was no point making sure a hash would be reused somewhere.
git's smallest unit is a file - if your repostiory has one 3GB file, that's what is going to be hashed.

but bup[0] does so within a git repository to make incremental backup more efficient.

[0] https://github.com/bup/bup