| You are literally building what I’ve been slowly working towards on my own. This seems like a very good sign. Multiple simultaneous discovery is a common occurrence in the sciences. The hard part is that those who donate their space have authority over that space. It’s the Byzantine fault tolerance problem: imagine if 4chan donated their space, then started serving CSAM instead of the expected data. You can use hashes to verify integrity, but then the question becomes who gets to decide which hashes are ok. And hashing makes it impossible to edit large files, which is a frequent occurrence in LLM work. You’re constantly tweaking your datasets and spitting out new blobs. Direct answer: yes, you’re doing good work, and you should keep doing it. I would personally use this for storing books3 transformations. The other hard part is that you’ll want at least some redundancy — see 6.824 distributed systems, or the GFS paper. It’s why I’ve been implementing Raft and toying with some kind of distributed consensus without a blockchain. (Such consensus is still possible if the researchers were granted authority over what can be stored — which is the whole reason people are donating their disk space in the first place.) Another issue is sudden bandwidth loss. Data storage is one part of it. The other half is rapid transfer. By replicating the data, you can pull it from multiple replicas at once (I.e. there are more seeders). This also protects against someone suddenly getting throttled, or just having a power outage. The protocol should prioritize donors with high bandwidth over vast storage space. Feel free to DM me on Twitter if you’d like to toss around some design ideas more seriously, and thank you for trying to build this. |
If you're doing this in bittorrent then you might want a client that's configured to optimize for a different goal than most torrent clients.
Potential goals, somewhat conflicting:
A) keep data with a low mirroring degree available. either this needs to be centrally coordinated or some sort of randomized algorithm where clients pick underseeded torrents but not everyone picks the same
B) bandwidth matching. to not consume more resources than are provided a client maybe should only download 1 piece of data for every N times it uploaded any piece. This is much less greedy than what you'd have in a normal torrent client but ensures that caches themselves don't take up much bandwidth compared to users who actually want to download something. Otherwise a misconfigured cache (e.g. behind NAT) could accidentally always download data without ever giving much back.