| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marginalia_nu 1307 days ago

Depending on how your madvise is set up, it's often the case that sequential disk reads are memory reads. You're typically only paying the price for touching the first page in a sequential run, that or subsequent page faults come at a big discount.

If you read 1,000,000 random bytes (~1 Mb) scattered across a huge file (let's say you're fetching from some humongous on-disk hash table), it will to a first order be about as slow as reading 4 Gb sequentially. This will incur the same number of page faults. There are ways of speeding this up, but only so much.

Although, I/O is like an onion of caching layers, so in practice this may or may not hold up depending on previous access patterns of the file, lunar cycles, whether venus is in retrograde.

1 comments

Sirupsen 1307 days ago

`madvise(2)` doesn't matter _that_ much in my experience with [1] on modern Linux kernels. SSD just can't read _quite_ as quickly as memory in my testing. Sure, SSD will be able to re-read a lot into ram, analogous to how memory reading will be able to rapidly prefetch into L1.

I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

[1]: https://github.com/sirupsen/napkin-math

link

marginalia_nu 1307 days ago

I think the effects of madvise primarily crop up in extremely I/O-saturated scenarios, which are rare. Reads primarily incur latency, with a good SSD it's hard to actually run into IOPS limitations and you're not likely to run out of RAM for caching either in this scenario. MADV_RANDOM is usually a pessimization, MADV_SEQUENTIAL may help if you are truly reading sequentially, but may also worsen performance as pages don't linger as long.

But as I mentioned, there's caching upon caching, and also protocol level optimizations, and hardware-level considerations (physical block size may be quite large but is generally unknown).

It's nearly impossible to benchmark this stuff in a meaningful way. Or rather, it's nearly impossible to know what you are benchmarking, as there are a lot of nontrivially stateful parts all the way down that have real impact on your performance.

There are so many moving parts I think the only meaningful disk benchmarks consider whatever application you want to make go faster. Do the change. Is it faster? Great. Is it not? Well at least you learned.

link

menaerus 1307 days ago

> I get ~30 GiB/s for threaded sequential memory reads, but ~4 GiB/s for SSD. However, I think the SSD number is single-threaded and not even with io_uring—so I need to regenerate those numbers. It's possible it could be 2-4x better.

Assuming that you run the experiments on NVMe SSD which is attached to PCIe 3.0, where theoretical maximum is around 1GB/s per each lane, I am not sure I understand how do you expect to go faster than 4 GiB/s? Isn't that already a theoretical maximum of what you can achieve?

link

formerly_proven 1307 days ago

PCIe 4.0 SSDs are pretty common nowadays and are basically limited to what PCIe 4.0 x4 can do (around 7 GB/s net throughput).

link

menaerus 1307 days ago

I don't think they're that common. You would have to have quite recentish motherboard and CPU that both support PCIe 4.0.

And I'm pretty sure that parent comment doesn't own such a machine because otherwise I'd expect 7-8GB/s figure to be reported in the first place.

link

dagmx 1307 days ago

I really doubt they’re that common. They only became available on motherboards fairly recently, and are quite expensive.

I’d guess that they’re a small minority of devices at the moment.

link

robocat 1306 days ago

PCIe 5.0 has just recently started showing up on consumer motherboards.

4.0 might not be common, but surprisingly it is now the previous generation!

link

Sirupsen 1307 days ago

You might be very right about that! It's been a while since I did the SSD benchmarks. Glad to hear it's most likely entirely accurate at 4 GiB/s then!

link

shitlord 1307 days ago

How'd you measure the maximum memory bandwidth? In Algorithmica's benchmark, the max bandwidth was observed to be about 42 GBPS: https://en.algorithmica.org/hpc/cpu-cache/sharing/

I'm not sure how they calculated the theoretical limit of 42.4 GBPS, but they have multiple measurements higher than 30 GBPS.

link