| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by scrubs 1410 days ago

This benchmark reminds me of "ffwd: delegation is (much) faster than you think" https://www.seltzer.com/margo/teaching/CS508-generic/papers-....

This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.

This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.

The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."

The code does delegation without CAS, locks, or atomics.

The efficacy of such a scheme rests on two facets, which the paper explains:

* Modern CPUs can move GBs/second between core L2/LLC caches

* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.

I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.

In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:

https://rigtorp.se/ringbuffer/

did far, far better.

In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.