Hacker News new | ask | show | jobs
by jandrese 205 days ago
I don't know how true this is, but it seems to me that SSD firmware has to be more complex than HDD firmware and I've seen far more SSDs die due to firmware failure than HDDs. I've seen HDDs with corrupt firmware (junk strings and nonsense values in the SMART data for example), but usually the drive still reads and writes data. In contrast I've had multiple SSDs, often with relatively low power-on hours, just suddenly die with no warning. Some of them even show up as a completely different (and totally useless) device on the bus. Drives with Sandforce controllers used to do this all of the time, which was a problem because Sandforce hardware was apparently quite affordable and many third party drives used their chips.

I have had a few drives go completely read only on me, which is always a surprise to the underlying OS when it happens. What is interesting is you can't predict when a drive might go read-only on you. I've had a system drive that was only a couple of years old and running on a lightly loaded system claim to have exhausted the write endurance and go read only, although to be fair that drive was a throwaway Inland brand one I got almost for free at Microcenter.

If you really want to see this happen try setting up a Raspberry Pi or similar SBC off of a micro-SD card and leave it running for a couple of years. There is a reason people who are actually serious about those kinds of setups go to great lengths to put the logging on a ramdisk and shut off as much stuff as possible that might touch the disk.

3 comments

I worked on SSD firmware for more than a decade from the early days of SLC memory to TLC memory. SLC memory was so rock solid that you hardly needed any ECC protection. You could go months of use without any errors. And the most common error was erase error which just means to no longer use that back.

But then as the years progressed, the transistors were made smaller and MLC and TLC were introduced all to increase capacity but it made the NAND worse in every other way like endurance, retention, write/erase performance, read disturb. It also makes the algorithms and error recovery process more complicated.

Another difficult thing is recovering the FTL mapping tables from a sudden power loss. Having those power loss protection capacitors makes it so much more robust in every way. I wish more consumer drives included them. It probably just adds $2-3 to the product cost.

That's kind of that ZNS is for: make the SSD dumb but in exchange predictable; let the database on top that already uses some type of CoW structure handle the quantization of erasure blocks; expose all overprovisioning from the start and just give back less usable capacity after an erasure block for erased and skip over any read access sized blocks that got killed off there when mapping logical addresses to physical ones. That has to exist anyways because due yield reasons some percentage of blocks is expected dead from the factory.
> it seems to me that SSD firmware has to be more complex than HDD firmware

I think they’re complicated in different ways. A hard desk drive has to have an electromagnet powered up in a motor that arm that moves and reads the magnetic balance of the part of the drive under the read head and correlate that to something? Oh, and there are multiple read heads. Seems ridiculously complex!

Yet somehow firmware bugs are endemic on SSDs far more than they were on HDDs.