Hacker News new | ask | show | jobs
by implying 1932 days ago
Torrent seed files only contain hashes for "pieces" (typically 1-4MB each), which can cross file boundaries. There's no way to pull an md5 for a specific file without actually snatching it.
2 comments

BitTorrent v2 also includes a per file hash tree https://blog.libtorrent.org/2020/09/bittorrent-v2/

so I assume that if you end up with the same file in multiple torrents, you'll be able to grab the files from seeds of these other torrents.

It would be interesting to see if that standard replaces Bittorent v1. For now it is not nearly as popular (everyone just uses v1, I don't know about any site, which has any significant number of v2 torrents).
Pull the files, derive the hashes, and then throw away the files? You only need to retrieve each file once to derive and store the per file hashes.
I though about that, it is technically possible. More than that, it is relatively cheap (incoming traffic on most VPS is free). Still, I decided that searching by file size is much more cost effective. It is almost as good as searching by hash for larger files. It requires little to no code and time to implement. It is available even for torrents without seeds. It doesn't involve downloding files (which, in some people opinion, is illegal as such, even if you don't actually see the downloaded files).