Hacker News new | ask | show | jobs
by kiwijamo 754 days ago
I had an Team Group SSD that very occasionally would commit a successful write only to be followed by a read failure several weeks/months later. Eventually it got to the point where some blocks just wouldn't read at all (or just read corrupted data) I ended up getting a RMA replacement.

On the replacement drive I used the badblocks utility to do a read/write test to ensure every block on the SSD was fine after a write/read of every sector.

Probably not best practice but how do I check the SSD is fine in the first place, especially a blank SSD? My issue is reading a black SSD is likely to work just fine as presumably if there is no written data yet the SSD controller can short-circuit and just return a zero-filled response. This means the underlying media isn't tested at all if I am understanding it correctly.

The first SSD I got seemed fine (even SMART kept parroting that everything was fine even though some of the more detailed SMART data were showing worrying trends) and it was only when I noticed that some files on the NTFS partition was not reading correctly that I started to suspect disk failure. At best it would read "fine" but with corrupted data but over time it'd start to simply hang on a read and fail to read.

Luckily I had md5 sums of some of these files and was able to confirm that several files were corrupted from between when the file was written (and the md5sum computed) and several weeks later which is how I ended up running badblocks on the first drive to confirm the defect. I wish I used ZFS and not NTFS.

1 comments

> Probably not best practice but how do I check the SSD is fine in the first place, especially a blank SSD?

The only way that is certain to check the memory cells is to overwrite the whole drive, flush all disk cache (power cycle the system), read all the written bytes, and check that the values read are the same as the values that have been written. This could be accomplished e.g. by setting up encryption on the whole drive on the block level (e.g. on Linux, LUKS), writing zeroes to the open (decrypted) volume, and after power cycle, opening (decrypting) the volume again and checking that all bytes read are zero.

A simpler, less reliable, but still worthy test would be to do the same, except instead of checking the read values, just throwing them away (e.g. on Linux, redirecting to /dev/null). The disk firmware should still try to read all the sectors, and if it is not lying too much, show read problems/reallocated sectors in the SMART data.