Hacker News new | ask | show | jobs
by mjpt777 4094 days ago
Thanks for the link.

If you look at the implementation no thread is ever stopped from completing. If a producing thread was killed mid operation by another malicious thread then the stream is broken. Other threads are never blocked. Other writers can add via a finite number of steps and the stream will back pressure when flow control kicks in. The consumer is never blocked - it simply never sees more messages. The algorithm meets the definition.

Unless I missed it that paper does not not state that the algorithm must cope with a malicious thread to be wait-free.

Try applying your thinking to this other respected MPSC wait-free queue.

http://www.1024cores.net/home/lock-free-algorithms/queues/in...

3 comments

Regarding applying the wait/lock/obstruction freedom definition to Vyukov's MPSC implementation you reference.

It's clear that the MPSC design is not wait-free, nor is it lock-free. Dmitry identifies this fact in his initial postings about that design: https://groups.google.com/forum/#!topic/lock-free/Vd9xuHrLgg...

""" Push function is blocking wrt consumer. I.e. if producer blocked in (*), then consumer is blocked too. """

Between the XCHG and the update of next. If the producing thread completes the XCHG but fails to update next, all threads will fail to make progress. "Progress" in this case being the producers transferring data to the consumer. Put another way, in this algorithm a single misbehaving thread can prevent all threads from making progress.

Regarding the idea of "maliciousness", specifically if a participant thread is "killed mid-operation". That idea is the very definition of wait-freedom: that all participants complete the algorithm in a bounded number of their own steps, irrespective of the activities of other threads.

The definitions of wait/lock/obstruction freedom are well specified. I suggest the first half of _The Art of Multiprocessor Programming_ (the revised edition!) by Herlihy and Shavit for a deep dive.

I've read the _The Art of Multiprocessor Programming_ and am happy to read it again. Maybe I have misinterpreted.

So if "killed mid-operation" must be supported then I don't see how many algorithms can be said to make progress. Take for example the Lamport SPSC queue[1], if the producer gets killed mid operation between steps 4 and 5 of the push operation. Then the data is in the queue but the consumer is blocked from ever seeing it with this line of reasoning. The Lamport SPSC queue is considered wait-free by the concurrency community I know. I base my reasoning on this. What if the producer takes a pause for a long time between steps 4 and 5 before continuing?

However if to be wait-free an algorithm must allow all other threads to continue using the data structure, not just continue making progress in other ways by being non-blocking and completing in a finite number of steps, then I stand corrected.

If wait-free must include coping with any thread being killed mid operation is there a term for being lock-free but also having all threads not block and complete in a finite number of steps for their interaction with the algorithm?

[1] - http://arxiv.org/pdf/1012.1824.pdf

I just had a look at the definitions again. The presented log buffer implementation is neither wait-free, nor lock-free, nor obstruction-free because one failed writer will prevent the progress of any other thread unconditionally, in consequence this is accurately described as a blocking algorithm. To just quote from Wikipedia »[...] an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread [...]« which is obviously not the case because a failed writer will cause all subsequent writes to fail, i.e. the writers will happily fill the buffer but the messages never really make it into the buffer in a way that the reader can see them.

In case of the SPSC queue the requirement is that the writer must be able to enqueue new items no matter what the reader does and the reader must be able to dequeue all items for which the enqueue operation succeeded no matter what the writer does afterwards. The presented queue implementation meets this requirements. If the writer fails between steps 4 and 5 the write did not succeed and it does not matter that the reader can not see the item.

What you say about keeping the data structure usable for other threads seems correct to me.

You must not confuse lock-free and not using a lock, these are two different things. An algorithm using locks is a blocking algorithm but using locks is only sufficient, not mandatory. If no process can block other processes unconditionally the algorithm is non-blocking and obstruction-free, lock-free and wait-free are different progress guarantees for non-blocking algorithms in ascending order of strength.

One last point, if you try to obtain a lock and fail to do so you are considered blocked, no matter if you spin, suspend the thread or do unrelated things to waste some time until retrying later.

When I re-read the definitions I can see what you take from it. I think it comes down to what you consider as systemic progress. With respect to the preciseness of the definitions I have likely misinterpreted this. Is the system the algorithm itself or is it the system it lives in? I assumed the latter which may well be a mistake.

Each thread under the algorithm can perform their actions in a finite number of steps without ever blocking. This means the producers can continue to do other work. The consumer can continue to consume from other log buffers without being blocked and complete in a finite number of steps. If a producer is killed mid operation then no further progress can be made on that log buffer. If this is considered blocking then the algorithm is blocking and therefore not wait-free. It would need to be killed by another malicious thread for this to happen.

What is clear is that this algorithm gives the best latency profile of of all the measured messaging systems and the highest throughput. I now have the challenge of searching for a name that best describes its behaviour.

Glad to see that we reached consensus. As mentioned before, wait-freedom has nice progress guarantees but does not necessarily provide the best possible latency or throughput because of the overhead associated with guaranteeing progress for all live threads. I also finally realized that you are the speaker, didn't make the connection before.

And something I wanted to mention before but forgot to do - there is not only a problem if the responsible writer fails while trying to rotate the buffers but also if there is just another writer trying to write before the the buffer rotation completed. There is really not much you can do in this case besides retrying until you succeed. But this again also means that a writer may have bad luck and every time he looks a buffer rotation - not necessarily the same one - is in progress causing the writer to starve.

While one thread is in the action of rotation other producers return right away from the offer. The possibility of starvation occurs if the same thread each time it retried, that buffer has advanced to the next rotation. If adding 100 byte messages to a 128 MB buffer that is greater than a 1 in a million chance on each rotation. To have this continue then the probabilities have to be multiplied for the number of times you expect it to happen. So for ultimate starvation that gets crazy very very quickly ;-) Do you see it as more likely than that?
Two threads want to send a single block message. Writer one advances the tail pointer by 100 bytes and then dies. Writer two then advances the tail pointer by another 100 bytes, copies its message into the buffer and writes its header. The reader will see the tail pointer 200 bytes past the last processed message but no valid header and this looks exactly like a writer currently in the process of placing a 200 byte message - including the header - into the buffer. Looks to me like thread one blocked thread two indefinitely because the message of thread two is never added to the buffer. Yes, the bytes are in the buffer, but only physically. The reader can not see them, i.e. the write operation did not logically complete.
If you look at the code the reader does not read the tail. It skips forward reading the length fields. It would never see the writer two message and it never blocks. It just thinks there are no new messages.
Writer two is blocked, not the reader. Writer two wants to add a message to the buffer but never succeeds because writer one fails to make the message of writer two visible to the reader.
Writer two is not blocked, it is not waiting, it is not being obstructed. It is wait-free. It returns and can do whatever work it wishes.

However the data structure would be corrupt if writer one never completed. This is corruption and not the definition of wait-free.

No, it is not wait-free. The operation of writer two does not even succeed. Some bytes get written to the buffer but the message is never really added to the log because to call a write a success it should at least become visible to the reader. Writer two returns from the write operation without noticing that the operation failed. To make sure that the write succeeded writer two would have to wait and verify that its message actually becomes visible to the reader.

You can even go a step further and make writer two block. Just assume that the next thing it does is waiting for a response from the receiver of its message. Now because writer one failed the message never reaches the receiver, the receiver never sends a response and the writer waits indefinitely for the response, i.e. is blocked. Now the failed writer one is really and the only reason writer two is blocked, even though only indirectly.

And you can not really call it a corruption of the data structure because it is the exact same stated that occurs during normal operation. If writer one would resume execution after two weeks suddenly the corruption would be gone.

What if, instead of being killed, the thread runs arbitrarily slowly (at glacial speeds) between the two critical steps? Do other threads have to wait during that time, (Note: I haven't slept in days, not did I read the algorithm as my brain isn't working well enough to understand it, but I did seem to get the gist of this back/forth, and this seemed like an important detail.)
Please read the algorithm in code. No threads ever wait. :-)