Hacker News new | ask | show | jobs
by prirun 39 days ago
My sister has a Windows 10 laptop she used for her accounting business. One day it decided not to boot, saying there was no boot device. I took the laptop home, took the SSD out (Samsung 1TB), put it in an external USB case, plugged it into another Windows laptop, and it showed up in Explorer. Weird.

I had another brand-new, identical Samsung SSD, so I hooked both the old and new drive up to a Linux laptop (with USB cases) and tried to dd the old drive to the new drive. That mostly worked, but VERY VERY slowly: it would run fast for 5 seconds and then have no activity for 30 seconds. I had a fan blowing on the old drive to keep it cool because it was running very hot.

The dd copy would eventually fail and then I'd restart it with appropriate iseek and oseek values. I also did a cmp /dev/zero with the new disk to verify that it was all zeroes (it was brand new), and that allowed me to use conv=sparse on the dd. The reason for that was to avoid writing to ever sector of the new disk; I didn't want to copy sectors from the old drive that had never been accessed (she only used about 250GB of the 1TB).

It took a couple of days and about 5 restarts to finish the copy, but it did work, and as a precaution, I made another copy of the drive and ran a cmp of the original drive and the 2nd copy (also having to restart cmp several times). Since that compare worked, I knew that all 3 drives had identical content. The new drive worked fine in her laptop and she was mighty glad to see her Windows login screen.

The thing that made this work, IMO, is that Linux has a longer timeout for errors than Windows apparently does, especially during the boot sequence. Plus Linux allows adjusting the drive timeout, so if the device is doing error recover, which is sometimes slow, it gives it time to finish rather than reporting an error.

One of my theories was that the bad SSD was overheating, but if that was the case, a cold boot should have worked, with the failures only coming later.

The other theory is that one of the chips on the SSD failed, so the drive was having to use the ECC codes to correct for the missing information, and the correction process was taking longer than Windows boot would tolerate.

1 comments

Next time you have a disk where you need to do repeated dd runs over different ranges, or suspect that you might need to, use ddrescue. It tracks which sectors have been recovered (and has lots of useful options).

You can also get 'partclone' to generate a list (in ddrescue format) of sectors containing data, so you don't need to try to read unused areas of the disk. For the partclone trick to work, the FS does need to be at least somewhat readable.