|
|
|
|
|
by notacoward
1540 days ago
|
|
> How did you know it was a change at rest? Because we had the checks for it in flight. Also, more often than not these same blocks had been checked before, and found to be fine. > The only correct way to test for bitrot is to read the data back immediately No, the only correct way is to read it back after some time has passed. Mis-written data is not the same as bitrot. > must be corrected or it must not be returned Every error-correction technique has a limit to how many simultaneous errors it can correct. Beyond that, bits can be flipped in a way that seems valid but in fact is not (detectable by cross-checking with other erasure-coded fragments of the same block on other machines). Just because you haven't seen it doesn't mean it doesn't happen. As I said, and as others have said many times, with sufficient scale and time even the most unlikely scenarios become almost inevitable. Why do you persist in telling me I didn't see what I saw with my own eyes? Are you assuming that my thirty years in storage gave me less understanding or insight regarding these issues than whatever experience (if any) you have? |
|
> No, the only correct way is to read it back after some time has passed. Mis-written data is not the same as bitrot.
Well, no. If you want to check for at-rest bitrot, you need to make sure that you've written out the correct thing. Otherwise it's not possible to tell at-rest corruption from the one that happened on the way in.
> Every error-correction technique has a limit to how many simultaneous errors it can correct.
But it can detect that the case when it can't recover. Which is why it will either produce a correct output or an error.
> As I said, and as others have said many times, with sufficient scale and time even the most unlikely scenarios become almost inevitable.
This is not an argument if it goes against how things actually work.
> Why do you persist in telling me I didn't see what I saw with my own eyes?
I am merely curious in your exact testing technique, because at-rest bitrot is vanishingly impossible, even at the exabyte scale. For it to happen, the data and its ECC (7-11% of the data size) need to be both corrupted in a coordinated way. That is exceedingly unlikely. Especially in the context of academic papers that found that on-disk corruption is nearly always clustered and is either small scale or full-sector failures.
So when you say you ran into a lot of these cases, it's only natural to ask for details. And "scale" is not a detail.
> Are you assuming that my thirty years in storage gave me less understanding or insight regarding these issues than whatever experience (if any) you have?
I have no way to tell. But given your experience, can you explain how at-rest bitrot, should it occur, can seep through the on-disk error correction? I am not talking about raid-style setups, just the banal ECC record in a disk sector [1].
[1] https://en.wikipedia.org/wiki/Advanced_Format#Overview (linking to Advanced Format, because it has a diagram)