Hacker News new | ask | show | jobs
by Sirupsen 1307 days ago
Yes, sequential I/O bandwidth is closing the gap to memory. [1] The I/O pattern to watch out for, and the biggest reason why e.g. databases do careful caching to memory, is that _random_ I/O is still dreadfully slow. I/O bandwidth is brilliant, but latency is still disappointing compared to memory. Not to mention, in typical Cloud workloads, IOPS are far more expensive than memory.

[1]: https://github.com/sirupsen/napkin-math

6 comments

> _random_ I/O is still dreadfully slow. I/O bandwidth is brilliant, but latency is still disappointing compared to memory.

The wondrous thing about modern CPU architectures (e.g. Zen3), though, is all the PCIe lanes you get with them. If you really need high random IOPS, you can now cram 24 four-lane NVMe disks into a commodity server (with PCIe M.2 splitter cards) and saturate the link bandwidth on all of them. Throw them all in a RAID0, and stick a filesystem on them with the appropriate stripe width, and you'll get something that's only about 3x higher-latency for cold(!) random reads, than a read from RAM.

(My company provides a data-analytics SaaS product; this is what our pool of [shared multitenant, high concurrency] DB read-replicas look like.)

I thought NVMe flash latency was measured in tens of microseconds. 3x RAM would be a fraction of a microsecond, right?
Under ideal conditions, yes. But the 3x difference I see in practice is less about NVMe being just that good; and more about operations against (main) memory getting bottlenecked under high all-cores concurrent access with no cross-workload memory locality to enable any useful cache coherence. And also about memory accesses only being “in play” when a worker thread isn’t context-switched out; while PCIe-triggered NVMe DMA can proceed while the thread has yielded for some other reason.

In other words, when measured E2E in the context of a larger work-step (one large enough to be interrupted by a context-switch), the mean, amortized difference between the two types of fetch becomes <3x.

Top of my wishlist for future architectures is “more, lower-width memory channels” — i.e. increased intra-CPU NUMAification. Maybe something CXL.mem will roughly simulate — kind of a move from circuit-switched memory to packet-switched memory, as it were.

How do you figure these things out, do you have special software to look at this?
I think the person you're replying to is confusing IOPS with latency. If you add enough parallelism, then NAND flash random-read IOPS will eventually reach DRAM performance.

But it's not going to be easy - for a sense of scale I just tested a 7950x at stock speeds with stock JEDEC DDR5 timings. I inserted a bunch of numbers in an 8GB block of memory, and with a deterministic random seed randomly pick 4kb pages, computing their sum and eventually reporting that (to avoid overly clever dead-code analysis, and make sure the data is fully read).

With an SSD-friendly 4K page size that resulted in 2.8 million iops of QD1 random read. By comparison, a web search for intel's 5800x optane's QD1 results shows 0.11 million iops, and that's the fastest random read SSD there is at those queue depths, AFAIK.

If you add parallelism, then ddr5 reaches 11.6 million iops at QD16 (via 16 threads), fast SSDs reach around 1 million, the optane reaches 1.5 million. An Epyc Genoa server chip has 6 times as many DDR5 memory channels as this client system does; and I'm not sure how well that scales, but 60 million 4kb random read iops sounds reasonable, I assume. Intel's memory controllers are supposedly even better (at least for clients). Turning on XMP and PBO improves results by 15-20%; and even tighter secondary/tertiary timings are likely possible.

I don't think you're going to reach those numbers not even with 24 fast NVMe drives.

And then there's the fact that I picked the ssd-friendly 4kb size; 64-byte random reads reach 260 million iops - that's not quite as much bandwidth as @ 4kb, but the scaling is pretty decent. Good luck reaching those kind of numbers on SSDs, let alone the kind of numbers a 12-channel server might reach...

We're getting close enough that the loss in performance at highly parallel workloads is perhaps acceptable enough for some applications. But it's still going to be a serious engineering challenge to even get there, and you're only going to come close under ideal (for the NAND) circumstances - lower parallelism or smaller pages and it's pretty much hopeless to arrive at even the same order of magnitude.

I measured ~1.2M IOPS (random reads, 4kiB) from 3xNVMe in a software RAID configuration on a commodity server running Ubuntu Linux in 2021. Using Samsung SSDs, not Optane.

If that scaled, it would be 9.6M IOPS from 24xNVMe.

Which is quite respectable, but still nevertheless a far cry from the presumable 60M+ iops the server would have using dram (if it scales linearly, which I doubt, it would hit 70M). Also, DRAM gets quite close to those numbers with only around 2 times as many threads as dram channels, but that NVMe setup will likely need parallelism of at least 100 to reach that - maybe much more.

Still, a mere factor 7 isn't a _huge_ difference. Plenty of use cases for that, especially since NAND has other advantages like cost/GB, capacity, and persistence.

But it's also not like this is going to replace dram very quickly. Iops is one thing, but latency is another, and there dram is still much faster; like close to 1000 times faster.

at this point cost would become the bottleneck. compare 24x1TB NVMe drives to 24TB of DDR5
That's an entirely different dimension - you can reach these throughput numbers on DDR5 likely even with merely 16GB. And an massive 12-channel 6TB socket solution will likely have slightly less than 6 times the random-read bandwidth. Capacity and bandwidth aren't strongly related here.
I’m running a FreeNAS box on an i3-8100. Right now I’m converting the NAS and my desktop to server chassis and putting them in a rack. Once I get a 10GB Unifi switch and NICs off ebay, I’m debating on running my desktop and servers diskless using iSCSI backed up by RAID0 NVME drives.
Whatever floats your boat, but iSCSI is limited to 1500 MTU (9k? Are you sure you can boot with 9k enabled?) and while you can have 10Gbit thoughput that doesn't mean what you will get it always, eg 100 IO operations would generate 100 packets and it doesn't matter if it was 1500B each or only 100B.

And you wouldn't see the speed improvement on RAID0 NVMe drives except extremely rare fully sequential operations lasting for at least tens of seconds.

You also can try it just by running a VM with iSCSI boot on your current desktop.

Been a long time since anything iscsi related didn't hand 9k, for boot or otherwise.

But I look at it this way. You need 40gbit networking for a single pci3 nvme ( and newer drives can saturate that, or close )

And because you're throttling throughput you'll see much more frequent, longer, queuing delays, on the back of a network stack that ( unless you're using rdma ) is already 5x-10x slower than nvme.

It'll be fast enough for lots of things, especially home/lab use, and it'll be amazing if you're upgrading from sata spinning disk.. but 10gbit is slow by modern storage standards.

Of course, that's not the only consideration. Shared storage and iscsi in particular can be extremely convenient! And sometimes offers storage functionality that clients don't have ( snapshots, compression, replication )

> Been a long time since anything iscsi related didn't hand 9k, for boot or otherwise.

Don't have anything on the hands to look if the boot firmware even allows to set 9k, but I didn't touch iSCSI boot for a long time, so I would take your word for it.

> But I look at it this way. You need 40gbit networking ... is already 5x-10x slower than nvme.

This one.

> It'll be fast enough for lots of things, especially home/lab use

Yep, in OP's case I would consider just leaving the OS on the local [fast enough] drive and using iSCSI (if for some reason NFS/SMB doesn't fit) for any additional storage. It would be fast enough for almost everything, while completely eliminating any iSCSI boot shenanigans /me shudders in Broadcom flashbacks.

Another neat thing about iSCSI is what you can re/connect it to any device on the network in a couple of minutes (first time, even faster later), sometimes it comes really handy.

> Whatever floats your boat, but iSCSI is limited to 1500 MTU (9k? Are you sure you can boot with 9k enabled?) and while you can have 10Gbit throughput that doesn't mean what you will get it always, eg 100 IO operations would generate 100 packets and it doesn't matter if it was 1500B each or only 100B.

Ugh, ISCSI does have queueing so you can have many operations in flight, and one operation doesn't really translate to one packet in the first place, kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

> kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

And here comes the latency! shining.jpg

It wouldn't be a problem for a desktop use of course[0], especially considering what 90% of operations are just read requests.

My example is crude and was more to highlight what iSCSI, by virtue of running over Ethernet, inherently has a limit of how many concurrent operations can go in one moment. It's not a problem for a HDD packed SAN (HDDs would impose an upper limit, because spinning rust is spinning) but for a NVMe (especially with a single target) it could diminish the benefits of such fast storage.

> The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

Even on a single physical link? Could work if the load is queue bound...

[0] hell, even on 1Gb link you could run multiple VMs just fine, it's just when you start to move hundreds of GBs...

>> kernel will happily pack few smaller operations to TCP socket into one packet when there is load.

>And here comes the latency! shining.jpg

Not really, if you get data faster than you can send packets (link full) there wouldn't be that much extra latency from that (at most one packet length which at 10Gbit speeds is very short) and it would be more than offset by the savings

Then again I'd guess that's mostly academic as I'd imagine not very many ISCSI operations are small enough to matter. Most apps read more than a byte at a time after all, hell, you literally can't read less than a block from a block device which is at least 512 bytes.

>> The single queue is the problem here but dumb admin trick is just to up more than one IP on the server and connect all of them via multipath

> Even on a single physical link? Could work if the load is queue bound...

You can also use it to use multiple NICs without bonding/teaming, althought it is easier to have them in separate network, IIRC linux had some funny business when if you didn't configure it correctly for traffic in same network it would pick "first available" NIC to send it and it needed /proc setting to change

To elaborate, default setting for /proc/sys/net/ipv4/conf/interface/arp_ignore (and arp_announce) is 0 which means

> 0 - (default): reply for any local target IP address, configured on any interface

> 0 - (default) Use any local address, configured on any interface 1

IIRC to do what I said required

    net.ipv4.conf.all.arp_ignore=1
    net.ipv4.conf.all.arp_announce=2
which basically changed that to "only send/respond to ARPs from NICs where actual address exists, not just ones with the address in same network" and fixed the problem.
I’m out of SATA ports and I have 2 M.2 slots available. When I can test with VM in my current desktop I will.
That's a lot of effort to put silent piece of silicon few metres away from the machine.

iSCSI gotta eat some of your CPU (you're changing "send a request to disk controller and wait" to "do a bunch of work to create packet,send it over the network, and get it back) if you don't have card with offload, it also might kinda not be fast enough to get the most out of NVMe, especially more in RAID0

And, uh, just don't keep anything important there...

It’s an i3 with 2 M.2 slots available. Enough for the home. SATA becomes the limit.
As a dev who operates fairly far away from hardware usually, is that similar to what the PS5 is doing?
Depending on how your madvise is set up, it's often the case that sequential disk reads are memory reads. You're typically only paying the price for touching the first page in a sequential run, that or subsequent page faults come at a big discount.

If you read 1,000,000 random bytes (~1 Mb) scattered across a huge file (let's say you're fetching from some humongous on-disk hash table), it will to a first order be about as slow as reading 4 Gb sequentially. This will incur the same number of page faults. There are ways of speeding this up, but only so much.

Although, I/O is like an onion of caching layers, so in practice this may or may not hold up depending on previous access patterns of the file, lunar cycles, whether venus is in retrograde.

`madvise(2)` doesn't matter _that_ much in my experience with [1] on modern Linux kernels. SSD just can't read _quite_ as quickly as memory in my testing. Sure, SSD will be able to re-read a lot into ram, analogous to how memory reading will be able to rapidly prefetch into L1.

I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

[1]: https://github.com/sirupsen/napkin-math

I think the effects of madvise primarily crop up in extremely I/O-saturated scenarios, which are rare. Reads primarily incur latency, with a good SSD it's hard to actually run into IOPS limitations and you're not likely to run out of RAM for caching either in this scenario. MADV_RANDOM is usually a pessimization, MADV_SEQUENTIAL may help if you are truly reading sequentially, but may also worsen performance as pages don't linger as long.

But as I mentioned, there's caching upon caching, and also protocol level optimizations, and hardware-level considerations (physical block size may be quite large but is generally unknown).

It's nearly impossible to benchmark this stuff in a meaningful way. Or rather, it's nearly impossible to know what you are benchmarking, as there are a lot of nontrivially stateful parts all the way down that have real impact on your performance.

There are so many moving parts I think the only meaningful disk benchmarks consider whatever application you want to make go faster. Do the change. Is it faster? Great. Is it not? Well at least you learned.

> I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

Assuming that you run the experiments on NVMe SSD which is attached to PCIe 3.0, where theoretical maximum is around 1GB/s per each lane, I am not sure I understand how do you expect to go faster than 4 GiB/s? Isn't that already a theoretical maximum of what you can achieve?

PCIe 4.0 SSDs are pretty common nowadays and are basically limited to what PCIe 4.0 x4 can do (around 7 GB/s net throughput).
I don't think they're that common. You would have to have quite recentish motherboard and CPU that both support PCIe 4.0.

And I'm pretty sure that parent comment doesn't own such a machine because otherwise I'd expect 7-8GB/s figure to be reported in the first place.

I really doubt they’re that common. They only became available on motherboards fairly recently, and are quite expensive.

I’d guess that they’re a small minority of devices at the moment.

PCIe 5.0 has just recently started showing up on consumer motherboards.

4.0 might not be common, but surprisingly it is now the previous generation!

You might be very right about that! It's been a while since I did the SSD benchmarks. Glad to hear it's most likely entirely accurate at 4 GiB/s then!
How'd you measure the maximum memory bandwidth? In Algorithmica's benchmark, the max bandwidth was observed to be about 42 GBPS: https://en.algorithmica.org/hpc/cpu-cache/sharing/

I'm not sure how they calculated the theoretical limit of 42.4 GBPS, but they have multiple measurements higher than 30 GBPS.

> Yes, sequential I/O bandwidth is closing the gap to memory.

Hilariously meanwhile, RAM has become significantly slower compared to CPU performance, i.e. you spend a disproportionate time to read and write to memory, so despite RAM is faster, CPU is way faster.

Which means I/O remains a bottleneck...

Random I/O with NVME is slower than sequential I/O still, but the gap between the two has been narrowed considerably and is quite high in historical/comparative absolute terms. To get close to peak random I/O limits, you need to dispatch a lot of commands in parallel—that’s an I/O idiom that doesn’t really exist in high level languages, and I think that’s where a lot of the challenge is.
The problem is that a lot of workloads using random I/O have dependencies between the different I/O operations. A database traversing a tree-like index cannot issue the next read until the previous one has finished. You are limited by latency, which for NVMe is still orders of magnitude worse than memory.
> A database traversing a tree-like index cannot issue the next read until the previous one has finished.

This applies to a single point query in a single tree.

The latency is reduced by overlap in obvious ways as soon as you have (1) a range query because it can read multiple subtrees in parallel, or (2) a query that reads multiple indexes in parallel, or (3) multiple queries from the application to the database in parallel.

This is why it's useful to design applications to make multiple queries in parallel. Web applications are a great example of this. Most applications where I/O performance matters at all have some natural way to parallelise queries.

Less obviously, the interior blocks of a B-tree are a relatively small part of a B-tree. I.e. most of the space is in used leaf blocks. If the database's cache strategy gives preference to interior nodes, and even more preference to nodes closer to the root of a tree, often several interior layers of the tree can fit entirely in RAM and the effect is is to reduce the latency of tree lookups further once the cache is warmed up.

Then even in large databases (a few TB), the latency of a single point query is reduced to a one or two read IOPS (because the leaf page to read which contains the query result is calculated from in-memory data). The application-visible query time is very similar to the I/O subsystem's timing characteristics, and a few MQPS are achievable (= "million queries per second"). Not many database engines achieve this, because they were designed in an area where I/O was much slower, but the I/O architecture does support it.

Source: Wrote a performance-optimised database engine for blockchain archive data, which is extremely random access (because of hashing), in the multiple terabytes range, and the application is bottlenecked on how many queries per second it can achieve. It's like the ideal case for working on random-access I/O performance :-)

Those are rarely the slow ones though. Lots of software simply has not been written to keep IO queues full. They read a bit of data, process it, read the next bit and so on. On a single thread. This makes all kinds of IO (including network) way slower than it has to be.

For example a tree-index can be parallelized by walking down different branches. On top of that one can issue a prefetch for the next node (on each branch) while processing the current ones.

Yup, a lot of software is (was?) written with assumptions that mattered with spinning rust. And even if author didn't intend to, serial code generates serial, dependent IO.
I thought for SSDs it didn't matter whether data was adjacent on disk?
Well, yes and no.

With spinning rust you have to wait for the sector you want to read to rotate underneath the read head. For a fast 10.000 RPM drive, a single rotation takes 6 milliseconds. This means that for random access the average latency is going to be 3 milliseconds - and even that's ignoring the need to move the read head between different tracks! Sequential data doesn't suffer from this, because it'll be passing underneath the read head in the exact order you want - you can even take the track switching time into account to make this even better.

SSDs have a different problem. Due to the way NAND is physically constructed it is only possible to read a single page at a time, and accessing a single page has a latency of a few nanoseconds. This immediately places a lower limit on the random read access time. However, SSDs allow you to send read commands which span many pages, allowing the SSD to reorder the reads in the most optimal way, and do multiple reads in parallel. This means that you only have to pay the random access penalty once - not to mention that you have to issue way fewer commands to the SSD.

SSDs try to make this somewhat better by having a very deep command queue: you can issue literally thousands of random reads at once, and the SSD will reorder them for faster execution. Unfortunately this doesn't gain you a lot if your random reads have dependencies, such as when traversing a tree structure, and you are still wasting a lot of effort reading entire pages when you only need a few bytes.

Interesting, thanks! So it sounds like it's not so much "random" I/O that's slow, but rather "unbatched" I/O or something like that?

Curious to hear your thoughts on this thread if you have time to share: https://news.ycombinator.com/item?id=33752870

> Unfortunately this doesn't gain you a lot if your random reads have dependencies, such as when traversing a tree structure,

So, this mean Btrees suffer? Which could be the most optimal layout for a database storage where only SSD matters?

I'm working in one that is just WAL-only and scanning all in each operation (for now!) and wanna see what I can do for improve the situation.

You really need an NVME interface to the SSD, though. SATA3 is the bottleneck for SATA SSDs
It is not just about IO itself but all the Processing which Database (or OS) needs to do for cache management, which in OLTP cases can be very significant. If you just need few bytes from that 8K/16K page you have to read there is little way around 2 orders of magnitude difference.

What we would really benefit is storage which is efficient in small (cpu cache line) size IO