| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by usefulcat 288 days ago
	I have heard that some Intel NICs can put received data directly into L3 cache. That would definitely make it faster to access than if it were in main RAM. If a NIC can do that over PCI, probably other PCI devices could do the same, at least in theory.

1 comments

wahern 288 days ago

For the curious,

> When a 100G NIC is fully utilized with 64B packets and 20B Ethernet overhead, a new packet arrives every 6.72 nanoseconds on average. If any component on the packet path takes longer than this time to process the individual packet, a packet loss occurs. For a core running at 3GHz, 6.72 nanoseconds only accounts for 20 clock cycles, while the DRAM latency is 5-10 times higher, on average. This is the main bottleneck of the traditional DMA approach.

> The Intel® DDIO technology in Intel® Xeon® processors eliminates this bottleneck. Intel® DDIO technology allows PCIe devices to perform read and write operations directly to and from the L3 cache, or the last level cache (LLC).

https://www.intel.com/content/www/us/en/docs/vtune-profiler/...

link

mrlongroots 288 days ago

From a response elsewhere in this thread (my current understanding, could be wrong):

> You can temporarily stream directly to the L3 cache with DDIO, but as it fills out the cache will get flushed back to the main memory anyway and you will ultimately be memory-bound. I don't think there is some way to do some non-temporal magic here that circumvents main memory entirely.

link

wahern 287 days ago

Ah. I guess it's only useful for an optimized router/firewall or network storage appliance, perhaps with a bespoke stack carefully tuned to quickly process and then hand the data back to the controller (via DDIO) before it flushes.

EDIT: Here's an interesting writeup about trying to make use of it with FreeBSD+netmap+ipfw: https://adrianchadd.blogspot.com/2015/04/intel-ddio-llc-cach... So it can work as advertised, it's just very constraining, requiring careful tuning if not outright rearchitecting your processing pipeline with the constraints in mind.

link