| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by secondcoming 83 days ago

If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;

part

3 comments

JonChesterfield 83 days ago

If it's a power of two, you don't need the branch at all. Let the unsigned index wrap.

link

loeg 83 days ago

You ultimately need a mask to access the correct slot in the ring. But it's true that you can leave unmasked values in your reader/writer indices.

link

dalvrosa 83 days ago

Interesting, I've never heard about anybody using this. Maybe a bit unreadable? But yeah, should work :)

link

mandarax8 83 days ago

See https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff... which takes it even further :)

Nice one!

I believe ConcurrencyKit's impl does this.

https://github.com/concurrencykit/ck/blob/master/include/ck_...

link

dalvrosa 83 days ago

Indeed that's true. That extra constraint enables further optimization

It's mentioned in the post, but worth reiterating!

link

foobar10000 82 days ago

Nice!

Should be able to push it more if

* we limit data shared to an atomic-writable size and have a sentinel - less mucking around with cached indexes - just spinning on (buffer_[rpos_]!=sentinel) (atomic style with proper sematics, etc..).

* buffer size is compile-time - then mod becomes compile-time (and if a power of 2 - just a bitmask) - and so we can just use a 64-bit uint to just count increments, not position. No branch to wrap the index to 0.

Also, I think there's a chunk of false sharing if the reader is 2 or 3 ahead of the writer - so performance will be best if reader and writer are cachline apart - but will slow down if they are sharing the same cacheline (and buffer_[12] and buffer_[13] very well may if the payload is small). Several solutions to this - disruptor patter or use a cycle from group theory - i.e. buffer[_wpos%9] for example (9 needs to be computed based on cache line size and size of payload).

I've seen these be able to pushed to about clockspeed/3 for uint64 payload writes on modern AMD chips on same CCD.

link

loeg 83 days ago

This was, in fact, mentioned in the article.

link