Hacker News new | ask | show | jobs
by HarryHirsch 3172 days ago
Flash memory has three operations, read, write and erase, the last two destructively. If you pretend they are harddisks with two operations of read and write you go through all sorts of contortions. Sometimes you fall flat on the face, as seen here.

Why don't operating systems treat SSDs more flash memory, and why doesn't the file system cooperate with the underlying hardware instead of pretending it's a disk? For home use that may even work, but in a demanding environment the extra complexity will invariably fail.

This is a genuine question, I'm an amateur here.

3 comments

Speaking as someone who used to work on SSD firmware, here are some rambling thoughts... Yes moving some of the FTL to the OS will help a lot in reducing the complexity for the SSD developer but the problems are just moved up the levels. The OS will probably still have to use a COW scheme aware of the block and page size restrictions of the underlying flash. And you can't do a raw disk copy without accounting for defective blocks. Maybe the SSD will still handle basic ECC protection and data scrambling but the OS will now have to handle read disturb, wear leveling, defect management, and data recovery using signal processing. But many of these characteristics will change from one NAND technology to another so someone will have to characterize and update the algorithms. I would actually say it is this last bit that really trips up SSD firmware design. Otherwise you would think after an iteration or two of firmware we would have a solid design but the flash technology tends to bring up some new requirements with each node that introduces more complexity.
There is some work on Open-Channel SSDs, that move most of the flash translation layer (FTL) to the host system. There are two major problems with this approach:

1. Each OS that wants to use the drive needs a compatible implementation of the FTL. Consumer systems always have at least two operating systems in play (UEFI counts for these purposes). Enterprise systems are where you will actually find non-boot data-only drives.

2. Flash memory changes. The FTL needs very different parameters depending on whether you're using Toshiba flash or Samsung flash, and even depending on whether you're using last year's Toshiba flash or the stuff they're manufacturing today.

These aren't insurmountable problems, but they're enough to keep such products confined to a small niche. Instead, we're seeing a trend of SSDs accepting optional hints that allow them to perform the kinds of optimizations you'd expect from a fully host-managed SSD. The ATA TRIM command was just the tip of this iceberg.

Could you provide more details on these hints? Are they ioctl calls? Assuming one is using the disk as a raw block device, without a filesystem.
I was referring to extensions to the command set the OS uses to interact with the drive itself. Some of these are quite like a madvise() call, but at a lower layer. Others permit the drive to expose a bit more information to the OS so that it can better optimize its IO patterns. I summarized the most recently standardized changes at [1], but there are several other features in the NVMe spec [2] that fall into this category. The extension for IO determinism has been approved for the next standard but the official spec for it hasn't been published. (I'm referring here mostly to NVMe stuff, but there are SCSI/SAS analogs to many of these features.)

[1] https://www.anandtech.com/show/11436/nvme-13-specification-p...

[2] http://www.nvmexpress.org/resources/specifications/

> Why don't operating systems treat SSDs as flash memory, and why doesn't the file system cooperate with the underlying hardware instead of pretending it's a disk? For home use that may even work, but in a demanding environment the extra complexity will invariably fail.

The simple reason is because the SSDs themselves expose a regular HD interface and then does a lot of the flash-memory related stuff itself. For example, if you don't include TRIM support (Which early SSDs did not have) there is no 'erase' command the OS can send to an SSD.

With that in mind, SSDs also have memory controllers on them that map the blocks the OS sees to actual SSD blocks (scattered across the memory chips). So when the OS writes to block 1 it may write to block 15 internally on the SSD, and then block 2 might write to block 4002. Combine this with caching and other various details on the SSD side, and it leaves little predictable behavior for the OS to exploit.