The part of this to pay the most attention to is way that it handles node failure:
> The traditional way to avoid blocking on coordinator crashes is to use three-phase commit, but we want to avoid the extra phase. We accomplish this by blocking on participant crashes instead of coordinator crashes.
That seems sensible, and is later justified by the fact that the memory nodes can be replicated, and can be highly available in their own right. This brings a significant limitation:
> When a participant memory node crashes, the system blocks any outstanding minitransactions involving the participant until the participant recovers.
So, in Sinfonia, the memory nodes HAVE to be highly available, or you've built a system with many single points of failure instead of just one.
Sinfonia is actually a pretty clever system, and despite some interesting edges (crash recovery and log compaction), is not particularly complicated. Directly comparing its memory model to multi-Paxos, though, rings a little bit hollow for me. One of these things is a non-blocking atomic commit protocol which allows transactions across reliable nodes, the other is a consensus protocol which replicates a log across unreliable nodes. They aren't really solving the same problem.
> So, in Sinfonia, the memory nodes HAVE to be highly available, or you've built a system with many single points of failure instead of just one.
Yes, in Sinfonia that's true since an item resides on only one node. The Scalaris project augments Sinfonia by performing operations on a majority of replicas.
> the other is a consensus protocol which replicates a log across unreliable nodes
Interesting. I hadn't thought of it like that. Indeed, when you frame it that way, it doesn't seem right to compare a minitransaction to multi-Paxos. I had thought of multi-Paxos as a mechanism to serialize updates on the replicas of a key-value store. In that framing, the comparison makes more sense.
The NameNode service supports high availability out of the box, and uses a QJM (Quorum Journal Manager) to share edits between the active/standby node. To say the NameNode is a SPOF is not only misleading, it's incorrect.
> the NameNode is also a scaling bottleneck. Services which build on Hadoop
What's indicating that the NameNode is a scaling bottleneck? There are production Hadoop clusters with PBs of data, and the NameNode has not been an issue for scalability. Just so we're clear, data written to HDFS does not flow through the NameNode.
> As a result, many enterprises that use Hadoop and HBase do so only in non-critical services, such as supporting the business analytics group
This is the point where I point to the hundreds of companies which rely on Hadoop to run mission critical applications (Facebook, Yahoo, Ebay, Box, ..). Just google around...look at customers of Cloudera, Hortonworks, Pivotal, etc.
That's right, it should be noted however that the need of horizontally scaling the equivalent of the NameNode kicks in only when you really have a very large storage system, where large can be defined as:
"You know you have a large storage
system when you get paged at 1 AM because you only have a
few petabytes of storage left." [1]
Even if you are below that size you might have a large system, and even if using a single metadata master node might be a sound solution you still have lot of interesting problems to solve. Don't make systems more complex than necessary.
More pointers to google public material on colossus in [2].
Very few cluster sizes in enterprises reach this limit, and by limit I mean 1k -> 2k nodes. In reality, there's very little demand for namenode scalability at the current moment, but once this changes, the community will implement it.
> Just so we're clear, data written to HDFS does not flow through the NameNode.
I wrote that poorly. You're right, file blocks do not flow through the NameNode, only metadata updates do. The issues occur when the clients' metadata operations saturate the node.
> This is the point where I point to the hundreds of companies
Indeed. Many companies have had great success with Hadoop. It's one of the reasons Hadoop sees such broad use. However, other users' mileage has varied.
Ah, I see how you could interpret the diagram that way. The client and server are clearly different machines. However, the replicas and acceptors can be on the same machine, in fact in the same process. And the server can be on the same machine as a replica or acceptor. So only three machines are necessary.
Having Jepsen tests would be a good thing. The community trusts them, and it's wise to have tests from someone other than the author of the main-line code. If I'm so lucky as to find funding--my savings will only go so far--then I or someone I hire will eventually apply Jepsen to Treode. In the meantime, there are already tests a bit like Jepsen in the repository. They simulate dropping and reordering messages, and crashing and recovering nodes. They have proven their value as they have flushed out lots of early bugs.
My immediate concerns are finding help and finding a sponsor (investor or adventurous customer). To that end, I'm less coding right now and more promoting. Getting the word out will help me find developers interested in improving or trying it.
When I do get to code, I may focus on load and performance testing. I have not done enough work along those lines yet. Or, I may focus on an example that sketches how the read/write side of the API integrates easily with Etags, CDNs and web-clients, and how the scan API integrates with Spark or Map-Reduce.
The part of this to pay the most attention to is way that it handles node failure:
> The traditional way to avoid blocking on coordinator crashes is to use three-phase commit, but we want to avoid the extra phase. We accomplish this by blocking on participant crashes instead of coordinator crashes.
That seems sensible, and is later justified by the fact that the memory nodes can be replicated, and can be highly available in their own right. This brings a significant limitation:
> When a participant memory node crashes, the system blocks any outstanding minitransactions involving the participant until the participant recovers.
So, in Sinfonia, the memory nodes HAVE to be highly available, or you've built a system with many single points of failure instead of just one.
Sinfonia is actually a pretty clever system, and despite some interesting edges (crash recovery and log compaction), is not particularly complicated. Directly comparing its memory model to multi-Paxos, though, rings a little bit hollow for me. One of these things is a non-blocking atomic commit protocol which allows transactions across reliable nodes, the other is a consensus protocol which replicates a log across unreliable nodes. They aren't really solving the same problem.