|
> No serious person designing a filesystem today would say it's okay to misplace your data. Former LimeWire developer here... the LimeWire splash screen at startup was due to experiences with silent data corruption. We got some impossible bug reports, so we created a stub executable that would show a splash screen while computing the SHA-1 checksums of the actual application DLLs and JARs. Once everything checked out, that stub would use Java reflection to start the actual application. After moving to that, those impossible bug reports stopped happening. With 60 million simultaneous users, there were always some of them with silent disk corruption that they would blame on LimeWire. When Microsoft was offering free Win7 pre-release install ISOs for download, I was having install issues. I didn't want to get my ISO illegally, so I found a torrent of the ISO, and wrote a Python script to download the ISO from Microsoft, but use the torrent file to verify chunks and re-download any corrupted chunks. Something was very wrong on some device between my desktop and Microsoft's servers, but it eventually got a non-corrupted ISO. It annoys me to no end that ECC isn't the norm for all devices with more than 1 GB of RAM. Silent bit flips are just not okay. Edit: side note: it's interesting to see the number of complaints I still see from people who blame hard drive failures on LimeWire stressing their drives. From very early on, LimeWire allowed bandwidth limiting, which I used to keep heat down on machines that didn't cool their drives properly. Beyond heat issues that I would blame on machine vendors, failures from write volume I would lay at the feet of drive manufacturers. Though, I'm biased. Any blame for drive wear that didn't fall on either the drive manufacturers or the filesystem implementers not dealing well with random writes would probably fall at my feet. I'm the one who implemented randomized chunk order downloading in order to rapidly increase availability of rare content, which would increase the number of hard drive head seeks on non-log-based filesystems. I always intended to go back and (1) use sequential downloads if tens of copies of the file were in the swarm, to reduce hard drive seeks and (2) implement randomized downloading of rarest chunks first, rather than the naive randomization in the initial implementation. I say naive, but the initial implementation did have some logic to randomize chunk download order in a way to reduce the size of the messages that swarms used to advertise which peers had which chunks. As it turns out, there were always more pressing things to implement and the initial implementation was good enough. (Though, really, all read-write filesystems should be copy-on-write log-based, at least for recent writes, maybe having some background process using a count-min-sketch to estimate locality for frequently read data and optimize read locality for rarely changing data that's also frequently read.) Edit: Also, it's really a shame that TCP over IPv6 doesn't use CRC-32C (to intentionally use a different CRC polynomial than Ethernet, to catch more error patterns) to end-to-end checksum data in each packet. Yes, it's a layering abstraction violation, but IPv6 was a convenient point to introduce a needed change. On the gripping hand, it's probably best in the big picture to raise flow control, corruption/loss detection, retransmission (and add forward error correction) in libraries at the application layer (a la QUIC, etc.) and move everything to UDP. I was working on Google's indexing system infra when they switched transatlantic search index distribution from multiple parallel transatlantic TCP streams to reserving dedicated bandwidth from the routers and blasting UDP using rateless forward error codes. Provided that everyone is implementing responsible (read TCP-compatible) flow control, it's really good to have the rapid evolution possible by just using UDP and raising other concerns to libraries at the application layer. (N parallel TCP streams are useful because they typically don't simultaneously hit exponential backoff, so for long-fat networks, you get both higher utilization and lower variance than a single TCP stream at N times the bandwidth.) |
It's not my field, but my impression is that it would be equally resilient to just randomise the start block (adjust spacing of start blocks according to user bandwidth?) then let users just run through the download serially; maybe stopping when they hit blocks that have multiple sources and then skipping to a new start block?
It's kinda mindbogglingly to me too think of all the processes that go into a 'simple' torrent download at the logical level.
If AIs get good enough before I die then asking it to create simulations on silly things like this will probably keep me happy for all my spare time!