Hacker News new | ask | show | jobs
by JonChesterfield 89 days ago
It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Also doesn't have fences on the store, has extra branches that shouldn't be there, and is written in really stylistically weird c++.

Maybe an llm that likes a different language more, copying a broken implementation off github? Mostly commenting because the initial replies are "best" and "lol", though I sympathise with one of those.

2 comments

> It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Are we reading the same code? The stores are clearly after value accesses.

> Also doesn't have fences on the store

?? It uses acquire/release semantics seemingly correctly. Explicit fences are not required.

Push:

buffer_[head] = value;

head_.store(next_head, std::memory_order_release);

return true;

There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.

> There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.

No, this is incorrect. If you think there's no relationship, you don't understand "release" semantics.

https://en.cppreference.com/w/cpp/atomic/memory_order.html

> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.

This was _really_ surprising to me. What's the point of marking individual stores if it affects everything, not just that address. But yeah, what I can find online agrees that C++ has done this. Thanks!
This is just wrong. See https://en.cppreference.com/w/cpp/atomic/memory_order.html. Emphasis mine:

> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below).

write with release semantic cannot be reordered with any other writes, dependent or not.

Relaxed atomic writes can be reordered in any way.

> write with release semantic cannot be reordered with any other writes, dependent or not.

To quibble a little bit: later program-order writes CAN be reordered before release writes. But earlier program-order writes may not be reordered after release writes.

> Relaxed atomic writes can be reordered in any way.

To quibble a little bit: they can't be reordered with other operations on the same variable.

Yep, you are right, more precise, and precision is very important in this topic.

I stand corrected.

Sorry, but that's not actually true. There are no data races, the atomics prevent that (note that there are only one consumer and one producer)

Regarding the style, it follows the "almost always auto" idea from Herb Sutter

If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;
part
If it's a power of two, you don't need the branch at all. Let the unsigned index wrap.
You ultimately need a mask to access the correct slot in the ring. But it's true that you can leave unmasked values in your reader/writer indices.
Interesting, I've never heard about anybody using this. Maybe a bit unreadable? But yeah, should work :)
Nice one!
I believe ConcurrencyKit's impl does this.

https://github.com/concurrencykit/ck/blob/master/include/ck_...

Indeed that's true. That extra constraint enables further optimization

It's mentioned in the post, but worth reiterating!

Nice!

Should be able to push it more if

* we limit data shared to an atomic-writable size and have a sentinel - less mucking around with cached indexes - just spinning on (buffer_[rpos_]!=sentinel) (atomic style with proper sematics, etc..).

* buffer size is compile-time - then mod becomes compile-time (and if a power of 2 - just a bitmask) - and so we can just use a 64-bit uint to just count increments, not position. No branch to wrap the index to 0.

Also, I think there's a chunk of false sharing if the reader is 2 or 3 ahead of the writer - so performance will be best if reader and writer are cachline apart - but will slow down if they are sharing the same cacheline (and buffer_[12] and buffer_[13] very well may if the payload is small). Several solutions to this - disruptor patter or use a cycle from group theory - i.e. buffer[_wpos%9] for example (9 needs to be computed based on cache line size and size of payload).

I've seen these be able to pushed to about clockspeed/3 for uint64 payload writes on modern AMD chips on same CCD.

This was, in fact, mentioned in the article.