Have you given any more thought to ... not multithreading it? Since you're scaling across servers, apply the same concept across cores. Presto, no more bottleneck on atomically incrementing an ID.
Good thinking, but I think that shifts the issue--namely, that each inter-thread message uses atomic compare and swap to create the message. I assume there'd be a similar bottleneck on the actor that generates the transactionid limited by the number of messages it can send & receive.
Instead, a friend and I have been thinking about how to perhaps modify MVCC to work with distinct transactionid's per partition. Namely, I'm already generating what I call "subtransactionid"'s for each partition involved in a transaction. And those must be ordered for synchronous replication, so I think the way to implement a variation on MVCC may already be mostly there.
I know I still owe you an architectural doc...fixin' ta, ya know.