|
|
|
|
|
by derefr
1307 days ago
|
|
Under ideal conditions, yes. But the 3x difference I see in practice is less about NVMe being just that good; and more about operations against (main) memory getting bottlenecked under high all-cores concurrent access with no cross-workload memory locality to enable any useful cache coherence. And also about memory accesses only being “in play” when a worker thread isn’t context-switched out; while PCIe-triggered NVMe DMA can proceed while the thread has yielded for some other reason. In other words, when measured E2E in the context of a larger work-step (one large enough to be interrupted by a context-switch), the mean, amortized difference between the two types of fetch becomes <3x. Top of my wishlist for future architectures is “more, lower-width memory channels” — i.e. increased intra-CPU NUMAification. Maybe something CXL.mem will roughly simulate — kind of a move from circuit-switched memory to packet-switched memory, as it were. |
|