|
|
|
|
|
by jandrese
1840 days ago
|
|
Millions of individual torrents is not a great solution. Keeping them all seeded is basically impossible unless they run a seed for each one, at which point they might as well just host the files. Plus you'll never get the economy of scale that makes BitTorrent really shine. When you have a whole lot of tiny files that people will generally only want one or two of there isn't much better than a plain old website. A torrent that hosts all of the papers could be useful for people who want to make sure the data can't be lost by a single police raid. |
|
With the addition of hashsums (even MD5 ad SHA1, though longer and more robust hashsums are preferred), a pretty reliable archive of content can be made. It's a curious case where increased legibility seems to be breaking rather than creating a gatekeeper monopoly.
I've been interested in the notion of more reliable content-based identifiers or fingerprints themselves, though I've found little reliable reference on this. Ngram tuples of 4-5 words are often sufficient to identify a work, particularly if a selection of several are made. Aggreeing on which tuples to use, how many, and how to account for potential noise / variations (special characters, whitespace variance, OCR inaccuracy) is also a stumbling point.