Aeron: High-Performance Open Source Message Transport [video]

Y	Hacker News new \| ask \| show \| jobs

	Aeron: High-Performance Open Source Message Transport [video] (infoq.com)
	38 points by enginous 4140 days ago

2 comments

danbruc 4140 days ago

The described log buffer implementation does not seem wait-free to me. If one of the writers fails all other subsequent writes will fail, too. Subsequent writers can still advance the tail pointer and copy their messages into the buffer but the reader will never see their messages because the failed writer did not update the length field in its header.

So physically the operations for the subsequent writers completed after updating the length fields in their headers but logically they are not completed until they become visible to the reader and that requires updating the length field in the header of the failed writer. The situation seems even worse when the writer responsible to rotate the logs fails.

I did not look into the source code, they are probably dealing with failed writers, but the presentation alone does not provide any hints to me how they do it.

link

mjpt777 4140 days ago

Can you expand on how you see the writer can "fail"? Also please relate this to other algorithms you see as wait-free but don't have your failure issue. The code is very simple and does not have external dependencies. It is a three step process of advance tail, copy in message, apply header. All within the implementation, all threads make progress and completes in a finite number of steps, and thus wait-free.

https://github.com/real-logic/Aeron/blob/master/aeron-client...

link

danbruc 4140 days ago

I will kill or at least pause the thread after it advanced the tail pointer but before it updated the length field in the header. In other wait-free algorithms all the other threads will usually help failed threads to complete their work.

link

mjpt777 4140 days ago

Can you point to a respected publication that states that is the definition of a wait-free algorithm, i.e. that other threads must help and it must cope with killing the thread externally?

By kill a thread I assume you mean use the OS to interrupt and terminate the thread from the outside while in such a simple algorithm and yet expect it to cope. Have you tried this on any java.util.concurrent classes and reported the "bugs" you have found? I'd be interested in the feedback you got. :-)

link

danbruc 4140 days ago

This is the very definition of wait-free including the case that some concurrent operations never complete.

»A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes.« (Wait-Free Synchronization, Maurice Herlihy, 1991) [1]

As far as I can tell - I am definitely not an expert - there is no hard requirement that concurrent operations have to assist but it seems at least a common solution because undoing the partially completed work of other concurrent operations may have the consequence that they can not make progress and therefore the entire thing is no longer wait-free.

By the way, there is no requirement to explicitly kill the thread, you may as well just run out of stack space, the scheduler may decide to not assign a new time slice, cosmic rays may flip a bit in turn triggering a hardware exception because of an unknown instruction or just buggy or malicious code messing with the content of your address space. Every process can fail at any time.

[1] http://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf

link

mjpt777 4140 days ago

Thanks for the link.

If you look at the implementation no thread is ever stopped from completing. If a producing thread was killed mid operation by another malicious thread then the stream is broken. Other threads are never blocked. Other writers can add via a finite number of steps and the stream will back pressure when flow control kicks in. The consumer is never blocked - it simply never sees more messages. The algorithm meets the definition.

Unless I missed it that paper does not not state that the algorithm must cope with a malicious thread to be wait-free.

Try applying your thinking to this other respected MPSC wait-free queue.

http://www.1024cores.net/home/lock-free-algorithms/queues/in...

link

mjpt777 4140 days ago

I understand the point that we need to make software robust. No single algorithm can cope with every possible failure mode. That is why we need multiple instances run on multiple nodes to be resilient. This is independent from if this is a wait-free algorithm.

link

anton_gogolev 4140 days ago

Half-expected something related to Herman Miller's Aeron.

http://www.hermanmiller.com/products/seating/performance-wor...

link