Hacker News new | ask | show | jobs
by jeffbee 1540 days ago
Didn't you answer your own question above? It's firmware bugs. The disk reported a successful write at block X but it actually wrote block Y. Later you read block Y and you get data X. The block-level ECC codes are consistent. You also stand a low but not zero probability that you requested a read at block X and were served up some other block, again with matching checksums. And of course there's always the possibility that your firmware simply has a bug in the code checker.

The paper "Parity Lost and Parity Regained" assigns a probability of 1.88e−5 to misdirected writes bugs among disks, so if you have a warehouse full of disks you now have this nightmare.

1 comments

Fun question: what if a relocation table gets corrupted? And what protection is there against that possibility? You can bet it's not the same ECC as on data blocks. The rest is left as an exercise for the reader. ;)
uh oh, I recognize this one. Love to have a file corrupt after months at rest with no access logged and no mtime changes because a file on a neighboring track needed rewriting (SMR, of course).