The Minitransaction: An Alternative to Multi-Paxos and Raft

Y	Hacker News new \| ask \| show \| jobs

	The Minitransaction: An Alternative to Multi-Paxos and Raft (blog.treode.com)
	58 points by topher-the-geek 4367 days ago

4 comments

mjb 4367 days ago

The Sinfonia paper (http://www.sosp2007.org/papers/sosp064-aguilera.pdf) is interesting, and section 4 does a better job of explaining how this 'minitransaction' model works than this blog post does.

The part of this to pay the most attention to is way that it handles node failure:

> The traditional way to avoid blocking on coordinator crashes is to use three-phase commit, but we want to avoid the extra phase. We accomplish this by blocking on participant crashes instead of coordinator crashes.

That seems sensible, and is later justified by the fact that the memory nodes can be replicated, and can be highly available in their own right. This brings a significant limitation:

> When a participant memory node crashes, the system blocks any outstanding minitransactions involving the participant until the participant recovers.

So, in Sinfonia, the memory nodes HAVE to be highly available, or you've built a system with many single points of failure instead of just one.

Sinfonia is actually a pretty clever system, and despite some interesting edges (crash recovery and log compaction), is not particularly complicated. Directly comparing its memory model to multi-Paxos, though, rings a little bit hollow for me. One of these things is a non-blocking atomic commit protocol which allows transactions across reliable nodes, the other is a consensus protocol which replicates a log across unreliable nodes. They aren't really solving the same problem.

link

topher-the-geek 4366 days ago

> So, in Sinfonia, the memory nodes HAVE to be highly available, or you've built a system with many single points of failure instead of just one.

Yes, in Sinfonia that's true since an item resides on only one node. The Scalaris project augments Sinfonia by performing operations on a majority of replicas.

> the other is a consensus protocol which replicates a log across unreliable nodes

Interesting. I hadn't thought of it like that. Indeed, when you frame it that way, it doesn't seem right to compare a minitransaction to multi-Paxos. I had thought of multi-Paxos as a mechanism to serialize updates on the replicas of a key-value store. In that framing, the comparison makes more sense.

link

monstrado 4367 days ago

> The NameNode is a single point of failure.

The NameNode service supports high availability out of the box, and uses a QJM (Quorum Journal Manager) to share edits between the active/standby node. To say the NameNode is a SPOF is not only misleading, it's incorrect.

> the NameNode is also a scaling bottleneck. Services which build on Hadoop

What's indicating that the NameNode is a scaling bottleneck? There are production Hadoop clusters with PBs of data, and the NameNode has not been an issue for scalability. Just so we're clear, data written to HDFS does not flow through the NameNode.

> As a result, many enterprises that use Hadoop and HBase do so only in non-critical services, such as supporting the business analytics group

This is the point where I point to the hundreds of companies which rely on Hadoop to run mission critical applications (Facebook, Yahoo, Ebay, Box, ..). Just google around...look at customers of Cloudera, Hortonworks, Pivotal, etc.

link

spullara 4367 days ago

The name node is a significant vertical scaling challenge. It is one of the reasons that Yahoo limits their cluster sizes.

link

ithkuil 4367 days ago

That's right, it should be noted however that the need of horizontally scaling the equivalent of the NameNode kicks in only when you really have a very large storage system, where large can be defined as:

"You know you have a large storage system when you get paged at 1 AM because you only have a few petabytes of storage left." [1]

Even if you are below that size you might have a large system, and even if using a single metadata master node might be a sound solution you still have lot of interesting problems to solve. Don't make systems more complex than necessary.

More pointers to google public material on colossus in [2].

[1] http://static.googleusercontent.com/media/research.google.co...

[2] http://www.highlyscalablesystems.com/3202/colossus-successor...

link

monstrado 4367 days ago

Very few cluster sizes in enterprises reach this limit, and by limit I mean 1k -> 2k nodes. In reality, there's very little demand for namenode scalability at the current moment, but once this changes, the community will implement it.

link

topher-the-geek 4366 days ago

> Just so we're clear, data written to HDFS does not flow through the NameNode.

I wrote that poorly. You're right, file blocks do not flow through the NameNode, only metadata updates do. The issues occur when the clients' metadata operations saturate the node.

> This is the point where I point to the hundreds of companies

Indeed. Many companies have had great success with Hadoop. It's one of the reasons Hadoop sees such broad use. However, other users' mileage has varied.

link

batbomb 4367 days ago

So the acceptor is a replica-read-only machine and a client-propose-commit-only machine.

Also, the minimum production setup for this cluster is 6+1 machines?

link

topher-the-geek 4366 days ago

Ah, I see how you could interpret the diagram that way. The client and server are clearly different machines. However, the replicas and acceptors can be on the same machine, in fact in the same process. And the server can be on the same machine as a replica or acceptor. So only three machines are necessary.

link

fasteo 4366 days ago

In Aphyr's "Call me maybe" we trust

link

glibgil 4366 days ago

topher-the-geek, will you be writing Jepsen tests for Treode?

link

topher-the-geek 4365 days ago

Having Jepsen tests would be a good thing. The community trusts them, and it's wise to have tests from someone other than the author of the main-line code. If I'm so lucky as to find funding--my savings will only go so far--then I or someone I hire will eventually apply Jepsen to Treode. In the meantime, there are already tests a bit like Jepsen in the repository. They simulate dropping and reordering messages, and crashing and recovering nodes. They have proven their value as they have flushed out lots of early bugs.

My immediate concerns are finding help and finding a sponsor (investor or adventurous customer). To that end, I'm less coding right now and more promoting. Getting the word out will help me find developers interested in improving or trying it.

When I do get to code, I may focus on load and performance testing. I have not done enough work along those lines yet. Or, I may focus on an example that sketches how the read/write side of the API integrates easily with Etags, CDNs and web-clients, and how the scan API integrates with Spark or Map-Reduce.

link

fasteo 4365 days ago

Maybe you can try to contact aphyr. Some of his "call me maybe" articles have been funded (by Comcast if I remember well)

link