Hacker News new | ask | show | jobs
by personjerry 1830 days ago
How big is the write cache usually and how does it work? Typically I've seen the write caches be something like 32MB in size, but the "top speed" seems to be sustained for files much bigger than 32MB, which doesn't make sense to me if that top speed is supposedly from writing to the cache. How does that work?
3 comments

Getting full throughput from the SSD is less about file size and more about how much work is in the SSD's queue at any given moment. If the host system only issues commands one at a time (as would often result from using synchronous IO APIs), then the SSD will experience some idle time between finishing one command and receiving the next from the host system. If the host ensures there are 2+ commands in the SSD's queue, it won't have that idle time.

Then there's the matter of how much data is in the queue, rather than how many commands are queued. Imagine a 4 TB SSD using 512Gbit TLC dies, and an 8-channel controller. That's 64 dies with 2 or 4 planes per die. A single page is 16kB for current NAND, so we need 2 or 4 MB of data to write if we want to light up the whole drive at once, and that much again waiting in the queue to ensure the drive can begin the next write as soon as the first batch completes. But you can often hit a bottleneck elsewhere (either the PCIe link, or the channels between the controller and NAND) before you have every plane of every die 100% busy.

If you're working with small files, then your filesystem will be producing several small IOs for each chunk of file contents you read or write from the application layer, and many of those small metadata/fs IOs will be in the critical path, blocking your data IOs. So even though you can absolutely hit speeds in excess of 3 GB/s by issuing 2MB write commands one at a time to a suitably high-end SSD, you may have more difficulty hitting 3 GB/s by writing 2MB files one at a time.

It varies quite a bit. There are two different types of caches: SLC and DRAM. Most drives use SLC caching, higher end drives often use both.

Typically the SSDs with DRAM have a ratio of 1GB DRAM per TB of flash.

SLC caching is using a portion of the flash in SLC mode, where it stores 1 bit per cell rather than the typical 2-4 (2 for MLC, 3 for TLC, 4 for QLC) in exchange for higher performance. SLC cache size varies wildly. Some SSDs allocate a fixed size cache, some allocate it dynamically based on how much free space is available. It can potentially be 10s of GBs on larger SSDs.

The 1 GB DRAM per 1 TB Flash is to store the Flash Translation Layer mapping from logical addresses of the host system to the physical address in Flash. The write cache is separate and much more limited in size.
On SSDs? 32 is way off, the Samsung 470 had 256MB RAM cache and the 860 Pro a whopping 4GB for the top model.

Although they started removing it entirely for NVMe SSDs, I guess the direct transfer speed is enough to not need a cache at all.

The DRAM you're referring to is for the most part not a write cache for user data. Most of that DRAM is a read cache for the FTL's logical to physical address mapping table. When the FTL is working with the typical granularity of 4kB, you get a requirement of approximately 1GB of DRAM per 1TB of NAND.

Drives that include less than this amount of DRAM show reduced performance, usually in the form of lower random read performance because the physical address of the requested data cannot be quickly found by consulting a table in DRAM and must be located by first performing at least one slow NAND read.

NVMe drives can access system memory over the PCIe bus.