Hacker News new | ask | show | jobs
by el_pollo_diablo 885 days ago
There are many problems with this concurrent queue. Here are a few I found in a cursory reading.

In order to enqueue an element, you have to change two atomic pointers: tail->next and tail. That cannot be done atomically.

enqueue() assumes that the queue is not empty.

enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail).

Then there is the old-time favourite ABA problem.

By the way, when used in a loop, compare_exchange_weak is preferred to compare_exchange_strong, and there is no need to reload the atomic every time: compare_exchange_* do that for you by updating the argument containing the expected value.

4 comments

> In order to enqueue an element, you have to change two atomic pointers: tail->next and tail. That cannot be done atomically.

With added indirection, there are ways of doing this atomically.

I'm surprised more people are not calling out

"enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail)."

Its probably one of the bigger problems of this queue, basically negates the whole structure with this bug.

That concurrent queue is only there to illustrate usage of CAS in a data structure. I think having an actual implementation of a concurrent queue along with handling the ABA problem might be an entirely separate post.

I added in a note about the ABA problem but perhaps you're seeing a cached version of the post.

Ignoring the ABA problem, given the implementation of `new` and `delete` are blocking, so is the queue.
This is a great point. Technically you could replace your allocator with jemalloc or something similar, but most people probably don't.

Memory allocation gets forgotten in the "lock free" algorithms all the time, especially in java where allocation is forgotten about and brushed aside.

I think that the point rather was not to use any allocation in critical sections since allocator implementations are not lock-free or wait-free.

https://github.com/jemalloc/jemalloc/blob/dev/src/mutex.c

That was their point and that was my point.

Just because jemalloc has a mutex.c file, that doesn't mean that common paths aren't meant to be lock free and in the case of lots of little allocations that can go into small bucket sizes in jemalloc they should be.

It is still putting your head in the sand since at some point they have to go to the OS and map in memory which should lock and lots of small allocations are a terrible way to anything for performance, but it is possible to have some paths in an allocator not have locks.

Also if there are thread local heaps, those won't lock either.

You said

> Technically you could replace your allocator with jemalloc or something similar, but most people probably don't.

which suggests that the new/delete "blocking" nature can be solved just by replacing it with the jemalloc. That's nonsense because new/delete in itself is a plain stupid wrapper around the malloc/free. So, I still don't get the point of your commentary. It reads flawed and contradicting.

Nothing I said contradicts anything. I'm not sure where the disconnect is.

I didn't say replace new and delete with jemalloc, I said replace your allocator with jemalloc, which would mean new and delete end up calling that instead. This is a common and easy use of a different malloc implementation and is something I and many other people have done. jemalloc is also not the only allocator replacement and not the only one to focus on concurrency (tcmalloc and ptmalloc). There are also allocators like windows' built in thread local heaps. Some default malloc implementations now have some concurrency built in so they don't usually block.

What this comes down to is that new and delete don't have to block (in the common execution path) because the underlying allocator doesn't have to. This is a well worn problem and I have seen first hand parallel programs go from only using a single core while executing due to the default allocator blocking on every call to all cores being used with a new allocator. It is a problem created by too much allocation but the different implementations do deliver on their promise.

This is a separate issue from "lock free data structures" using memory allocation on every transaction which is a poor way to make any data structure and pushes a lot of the concurrency issues on to the memory allocator.

Hope that helps and that you learned something.