Hacker News new | ask | show | jobs
by danbruc 4093 days ago
I just had a look at the definitions again. The presented log buffer implementation is neither wait-free, nor lock-free, nor obstruction-free because one failed writer will prevent the progress of any other thread unconditionally, in consequence this is accurately described as a blocking algorithm. To just quote from Wikipedia »[...] an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread [...]« which is obviously not the case because a failed writer will cause all subsequent writes to fail, i.e. the writers will happily fill the buffer but the messages never really make it into the buffer in a way that the reader can see them.

In case of the SPSC queue the requirement is that the writer must be able to enqueue new items no matter what the reader does and the reader must be able to dequeue all items for which the enqueue operation succeeded no matter what the writer does afterwards. The presented queue implementation meets this requirements. If the writer fails between steps 4 and 5 the write did not succeed and it does not matter that the reader can not see the item.

What you say about keeping the data structure usable for other threads seems correct to me.

You must not confuse lock-free and not using a lock, these are two different things. An algorithm using locks is a blocking algorithm but using locks is only sufficient, not mandatory. If no process can block other processes unconditionally the algorithm is non-blocking and obstruction-free, lock-free and wait-free are different progress guarantees for non-blocking algorithms in ascending order of strength.

One last point, if you try to obtain a lock and fail to do so you are considered blocked, no matter if you spin, suspend the thread or do unrelated things to waste some time until retrying later.

1 comments

When I re-read the definitions I can see what you take from it. I think it comes down to what you consider as systemic progress. With respect to the preciseness of the definitions I have likely misinterpreted this. Is the system the algorithm itself or is it the system it lives in? I assumed the latter which may well be a mistake.

Each thread under the algorithm can perform their actions in a finite number of steps without ever blocking. This means the producers can continue to do other work. The consumer can continue to consume from other log buffers without being blocked and complete in a finite number of steps. If a producer is killed mid operation then no further progress can be made on that log buffer. If this is considered blocking then the algorithm is blocking and therefore not wait-free. It would need to be killed by another malicious thread for this to happen.

What is clear is that this algorithm gives the best latency profile of of all the measured messaging systems and the highest throughput. I now have the challenge of searching for a name that best describes its behaviour.

Glad to see that we reached consensus. As mentioned before, wait-freedom has nice progress guarantees but does not necessarily provide the best possible latency or throughput because of the overhead associated with guaranteeing progress for all live threads. I also finally realized that you are the speaker, didn't make the connection before.

And something I wanted to mention before but forgot to do - there is not only a problem if the responsible writer fails while trying to rotate the buffers but also if there is just another writer trying to write before the the buffer rotation completed. There is really not much you can do in this case besides retrying until you succeed. But this again also means that a writer may have bad luck and every time he looks a buffer rotation - not necessarily the same one - is in progress causing the writer to starve.

While one thread is in the action of rotation other producers return right away from the offer. The possibility of starvation occurs if the same thread each time it retried, that buffer has advanced to the next rotation. If adding 100 byte messages to a 128 MB buffer that is greater than a 1 in a million chance on each rotation. To have this continue then the probabilities have to be multiplied for the number of times you expect it to happen. So for ultimate starvation that gets crazy very very quickly ;-) Do you see it as more likely than that?
Not really, all these things are pretty unlikely. But if you deploy enough installations and run them with enough load you are pretty much guaranteed to see all the issues, even if only rarely. There will be someone simply killing no longer needed threads rendering a log buffer unusable from time to time. Someone will from time to time send relatively huge messages making the starvation scenario more likely. The thread won't of course starve forever but just hitting the rotation twice will make for a pretty heavy outlier in the latency chart.

And just in case I wasn't clear about that, I am looking at all of this from a pretty theoretical point of view - I have at best a very rough idea what this will be used for in the real world and what scenarios are important and likely and what just doesn't matter. Actually maybe even the thing I just wrote is nonsense because nobody in your target audience will simply kill threads, but then again developers are notoriously good at bending and ignoring rules.

Thanks for the feedback. I'll adjust my terminology to be more precise as it is the right thing to do. Doing things in the open is a great way to learn. I'm not shy of airing my laundry :-)