|
I think that's a generalization that simply shifts the burden elsewhere, and cannot be said to be "the right" architecture in general. There is a reason CPUs implement cache-coherence on top of their "innate" shared-nothing design, and the reason is abstraction. If you don't need certain abstractions, then a sharded approach is indeed optimal, but if you do, then you have to implement them at some level or another, and it's often better to rely on hardware messaging (cache-coherence) than software messaging. So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions. Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory. |
And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that. They just did a big storage engine rewrite and the result is slower.
https://issues.apache.org/jira/browse/CASSANDRA-7486