| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JonChesterfield 89 days ago

It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Also doesn't have fences on the store, has extra branches that shouldn't be there, and is written in really stylistically weird c++.

Maybe an llm that likes a different language more, copying a broken implementation off github? Mostly commenting because the initial replies are "best" and "lol", though I sympathise with one of those.

2 comments

loeg 89 days ago

> It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Are we reading the same code? The stores are clearly after value accesses.

> Also doesn't have fences on the store

?? It uses acquire/release semantics seemingly correctly. Explicit fences are not required.

link

JonChesterfield 89 days ago

Push:

buffer_[head] = value;

head_.store(next_head, std::memory_order_release);

return true;

There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.

link

loeg 89 days ago

> There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.

No, this is incorrect. If you think there's no relationship, you don't understand "release" semantics.

https://en.cppreference.com/w/cpp/atomic/memory_order.html

> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.

link

JonChesterfield 82 days ago

This was _really_ surprising to me. What's the point of marking individual stores if it affects everything, not just that address. But yeah, what I can find online agrees that C++ has done this. Thanks!

link

judofyr 89 days ago

This is just wrong. See https://en.cppreference.com/w/cpp/atomic/memory_order.html. Emphasis mine:

> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below).

link

blacklion 89 days ago

write with release semantic cannot be reordered with any other writes, dependent or not.

Relaxed atomic writes can be reordered in any way.

link

loeg 89 days ago

> write with release semantic cannot be reordered with any other writes, dependent or not.

To quibble a little bit: later program-order writes CAN be reordered before release writes. But earlier program-order writes may not be reordered after release writes.

> Relaxed atomic writes can be reordered in any way.

To quibble a little bit: they can't be reordered with other operations on the same variable.

link

blacklion 89 days ago

Yep, you are right, more precise, and precision is very important in this topic.

I stand corrected.

link

dalvrosa 89 days ago

Sorry, but that's not actually true. There are no data races, the atomics prevent that (note that there are only one consumer and one producer)

Regarding the style, it follows the "almost always auto" idea from Herb Sutter

link

secondcoming 89 days ago

If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;

part

link

JonChesterfield 89 days ago

If it's a power of two, you don't need the branch at all. Let the unsigned index wrap.

link

loeg 89 days ago

You ultimately need a mask to access the correct slot in the ring. But it's true that you can leave unmasked values in your reader/writer indices.

link

dalvrosa 89 days ago

Interesting, I've never heard about anybody using this. Maybe a bit unreadable? But yeah, should work :)

link

mandarax8 89 days ago

See https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff... which takes it even further :)

Nice one!

I believe ConcurrencyKit's impl does this.

https://github.com/concurrencykit/ck/blob/master/include/ck_...

link

dalvrosa 89 days ago

Indeed that's true. That extra constraint enables further optimization

It's mentioned in the post, but worth reiterating!

link

foobar10000 87 days ago

Nice!

Should be able to push it more if

* we limit data shared to an atomic-writable size and have a sentinel - less mucking around with cached indexes - just spinning on (buffer_[rpos_]!=sentinel) (atomic style with proper sematics, etc..).

* buffer size is compile-time - then mod becomes compile-time (and if a power of 2 - just a bitmask) - and so we can just use a 64-bit uint to just count increments, not position. No branch to wrap the index to 0.

Also, I think there's a chunk of false sharing if the reader is 2 or 3 ahead of the writer - so performance will be best if reader and writer are cachline apart - but will slow down if they are sharing the same cacheline (and buffer_[12] and buffer_[13] very well may if the payload is small). Several solutions to this - disruptor patter or use a cycle from group theory - i.e. buffer[_wpos%9] for example (9 needs to be computed based on cache line size and size of payload).

I've seen these be able to pushed to about clockspeed/3 for uint64 payload writes on modern AMD chips on same CCD.

link

loeg 89 days ago

This was, in fact, mentioned in the article.

link