|
|
|
|
|
by stavros
960 days ago
|
|
Since you're pretty knowledgeable about these things, I think I should ask here: I've made a fairly simple design for a program based on BitTorrent, that will allow people to "donate" their disk space to organizations like archive.org, Anna's Archive, and anything else that needs data hosted. Basically, you download a client, say "allocate 2 TB of my disks to whatever archive.org/donate/disk.rss" says, and the server/client combination ensures you download and seed the rarest 2TB of the collection. This design is also open, in the sense that the server can share the database of torrents it contains, and anyone can use it to fetch any of the files in the dataset from the swarm. Would something like this be at all useful? I've emailed a few archivists, but I got no response, and the one person I've managed to talk to about this said there have been a few attempts on this, but they always fail for one reason or another. |
|
The hard part is that those who donate their space have authority over that space. It’s the Byzantine fault tolerance problem: imagine if 4chan donated their space, then started serving CSAM instead of the expected data. You can use hashes to verify integrity, but then the question becomes who gets to decide which hashes are ok. And hashing makes it impossible to edit large files, which is a frequent occurrence in LLM work. You’re constantly tweaking your datasets and spitting out new blobs.
Direct answer: yes, you’re doing good work, and you should keep doing it. I would personally use this for storing books3 transformations.
The other hard part is that you’ll want at least some redundancy — see 6.824 distributed systems, or the GFS paper. It’s why I’ve been implementing Raft and toying with some kind of distributed consensus without a blockchain. (Such consensus is still possible if the researchers were granted authority over what can be stored — which is the whole reason people are donating their disk space in the first place.)
Another issue is sudden bandwidth loss. Data storage is one part of it. The other half is rapid transfer. By replicating the data, you can pull it from multiple replicas at once (I.e. there are more seeders). This also protects against someone suddenly getting throttled, or just having a power outage. The protocol should prioritize donors with high bandwidth over vast storage space.
Feel free to DM me on Twitter if you’d like to toss around some design ideas more seriously, and thank you for trying to build this.