Hacker News new | ask | show | jobs
by secondcoming 83 days ago
If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;
part
3 comments

If it's a power of two, you don't need the branch at all. Let the unsigned index wrap.
You ultimately need a mask to access the correct slot in the ring. But it's true that you can leave unmasked values in your reader/writer indices.
Interesting, I've never heard about anybody using this. Maybe a bit unreadable? But yeah, should work :)
Nice one!
I believe ConcurrencyKit's impl does this.

https://github.com/concurrencykit/ck/blob/master/include/ck_...

Indeed that's true. That extra constraint enables further optimization

It's mentioned in the post, but worth reiterating!

Nice!

Should be able to push it more if

* we limit data shared to an atomic-writable size and have a sentinel - less mucking around with cached indexes - just spinning on (buffer_[rpos_]!=sentinel) (atomic style with proper sematics, etc..).

* buffer size is compile-time - then mod becomes compile-time (and if a power of 2 - just a bitmask) - and so we can just use a 64-bit uint to just count increments, not position. No branch to wrap the index to 0.

Also, I think there's a chunk of false sharing if the reader is 2 or 3 ahead of the writer - so performance will be best if reader and writer are cachline apart - but will slow down if they are sharing the same cacheline (and buffer_[12] and buffer_[13] very well may if the payload is small). Several solutions to this - disruptor patter or use a cycle from group theory - i.e. buffer[_wpos%9] for example (9 needs to be computed based on cache line size and size of payload).

I've seen these be able to pushed to about clockspeed/3 for uint64 payload writes on modern AMD chips on same CCD.

This was, in fact, mentioned in the article.