| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by russell_sears 3619 days ago

There are two effects at play. If you're reading a multiple of a flash page at a time, then you can get random bandwidth to match sequential. Under the hood, decent SSDs actually round robin sequential reads across all of the underlying hardware, so they can run as fast as whatever their bottleneck resource is. The bottleneck could be error correction, the SATA link, or the channels (copper traces) between the NAND and the controller, for example. Random reads with enough parallelism give the SSD enough scheduling freedom to have the same throughput as sequential.

The second effect comes from the fact that application data is rarely exactly the same size as a flash page. If you group all the related data on one page, then you don't waste time transferring bits you won't look at. B-Trees sort of do this, but they leave holes in pages so they can support efficient inserts (a good SSD or database will compress the holes, but that doesn't help random reads much, since it's still going to transfer entire flash pages to the SSD controller).

LSM-Trees push things a bit further, since new data ends up being grouped on higher levels of the tree. When I was benchmarking bLSM's predecessor, I worked out the expected throughput on the back of an envelope while I was waiting for results. It did much better than expected, since the benchmark heavily favored recently created data! YMMV, of course.