| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jeffbee 1540 days ago
	Didn't you answer your own question above? It's firmware bugs. The disk reported a successful write at block X but it actually wrote block Y. Later you read block Y and you get data X. The block-level ECC codes are consistent. You also stand a low but not zero probability that you requested a read at block X and were served up some other block, again with matching checksums. And of course there's always the possibility that your firmware simply has a bug in the code checker. The paper "Parity Lost and Parity Regained" assigns a probability of 1.88e−5 to misdirected writes bugs among disks, so if you have a warehouse full of disks you now have this nightmare.

1 comments

notacoward 1540 days ago

Fun question: what if a relocation table gets corrupted? And what protection is there against that possibility? You can bet it's not the same ECC as on data blocks. The rest is left as an exercise for the reader. ;)

link

jonah-archive 1540 days ago

uh oh, I recognize this one. Love to have a file corrupt after months at rest with no access logged and no mtime changes because a file on a neighboring track needed rewriting (SMR, of course).

link