Hacker News new | ask | show | jobs
by wtallis 1830 days ago
Getting full throughput from the SSD is less about file size and more about how much work is in the SSD's queue at any given moment. If the host system only issues commands one at a time (as would often result from using synchronous IO APIs), then the SSD will experience some idle time between finishing one command and receiving the next from the host system. If the host ensures there are 2+ commands in the SSD's queue, it won't have that idle time.

Then there's the matter of how much data is in the queue, rather than how many commands are queued. Imagine a 4 TB SSD using 512Gbit TLC dies, and an 8-channel controller. That's 64 dies with 2 or 4 planes per die. A single page is 16kB for current NAND, so we need 2 or 4 MB of data to write if we want to light up the whole drive at once, and that much again waiting in the queue to ensure the drive can begin the next write as soon as the first batch completes. But you can often hit a bottleneck elsewhere (either the PCIe link, or the channels between the controller and NAND) before you have every plane of every die 100% busy.

If you're working with small files, then your filesystem will be producing several small IOs for each chunk of file contents you read or write from the application layer, and many of those small metadata/fs IOs will be in the critical path, blocking your data IOs. So even though you can absolutely hit speeds in excess of 3 GB/s by issuing 2MB write commands one at a time to a suitably high-end SSD, you may have more difficulty hitting 3 GB/s by writing 2MB files one at a time.