| HN Mirror

Not exactly. Git uses per-file, per-commit, and per-directory hashes. It does make up a sort of hash tree, but the tree does not descend within a given file. You need to hash the entire file and determine its hash to know if any of the pieces you have are valid. This would be a problem for large files if many of those pieces come from untrusted sources -- you'd have to spend a lot of time/bandwidth downloading invalid data before you realized it. This isn't generally the case for Git, but it for BitTorrent.

BitTorrent traditionally solves this using a hash list. All of the data in a torrent is broken up between pieces of a chosen size, and hashes are calculated for each of those pieces individually. (These are generally not aligned with file boundaries, which is why you may have noticed that even if you tell your torrent to only download a certain set of files, you may still end up with some data from adjacent files.)

This entirely hash list is included in the torrent file. For torrents with a lot of data, this could be 10MB or more.

If you're using a magnet link, that torrent file needs to be downloaded from peers. This brings back the original problem: you need to download this entire large file before you know that the peer isn't just sending you random data.

BEP-30 proposes a solution: generate a binary hash tree whose leaves are the torrent pieces, and include only the single root hash in the torrent file to keep it small. When you're getting pieces from a peer, they send you the missing inner hashes of the tree that you need to verify the piece.

The minimum data transfer in ideal circumstances is increased a bit, but the peer-to-peer system is made more robust, able to identify invalid data much faster.

I think it's a great modification to the protocol. Unfortunately it isn't widely-enough supported to be practical for general use.