Hacker News new | ask | show | jobs
by derefr 1307 days ago
Under ideal conditions, yes. But the 3x difference I see in practice is less about NVMe being just that good; and more about operations against (main) memory getting bottlenecked under high all-cores concurrent access with no cross-workload memory locality to enable any useful cache coherence. And also about memory accesses only being “in play” when a worker thread isn’t context-switched out; while PCIe-triggered NVMe DMA can proceed while the thread has yielded for some other reason.

In other words, when measured E2E in the context of a larger work-step (one large enough to be interrupted by a context-switch), the mean, amortized difference between the two types of fetch becomes <3x.

Top of my wishlist for future architectures is “more, lower-width memory channels” — i.e. increased intra-CPU NUMAification. Maybe something CXL.mem will roughly simulate — kind of a move from circuit-switched memory to packet-switched memory, as it were.

1 comments

How do you figure these things out, do you have special software to look at this?