|
|
|
|
|
by bcrl
241 days ago
|
|
If you're doing large amounts of sequential reads from a filesystem, it's probably not in cache. You only get latency that low if you're doing nothing else that stresses the memory subsystem, which is rather unlikely. Real applications have overhead, which is why microbenchmarks like this are useless. Microbenchmarks are not the best first order estimate for programmers to think of. |
|
Memory subsystem bottlenecks are real, but even in real applications, it's common for the memory subsystem to not be the bottleneck. For example, in this case we're discussing system call overhead, which tends to move the system bottleneck inside the CPU (even though a significant part of that effect is due to L1I cache evictions).
Moreover, even if the memory subsystem is the bottleneck, on the system I was measuring, it will not push the sequential memory access time anywhere close to 1 nanosecond per byte. I just don't have enough cores to oversubscribe the memory bus 30×. (1.5×, I think.) Having such a large ratio of processor speed to RAM interconnect bandwidth is in fact very unusual, because it tends to perform very poorly in some workloads.
If microbenchmarks don't give you a pretty good first-order performance estimate, either you're doing the wrong microbenchmarks or you're completely mistaken about what your application's major bottlenecks are (plural, because in a sequential program you can have multiple "bottlenecks", colloquially, unlike in concurrent systens where you almost always havr exactly one bottleneck.) Both of these problems do happen often, but the good news is that they're fixable. But giving up on microbenchmarking will not fix them.