| Over the last decade, the distributed system nature of modern server hardware internals has become painfully evident in how software architectures scale on a single machine. The traditional approaches -- multithreading, locking, lock-free structures, etc -- are all forms of coordination and agreement in a distributed system, with the attendant scalability problems if not used very carefully. At some point several years ago, a few people noticed that if you attack the problem of scalable distribution within a single server the same way you would in large distributed systems (e.g. shared nothing architectures) that you could realize huge performance increases on a single machine. The caveat is that the software architectures look unorthodox. The general model looks like this: - one process per core, each locked to a single core - use locked local RAM only (effectively limiting NUMA) - direct dedicated network queue (bypass kernel) - direct storage I/O (bypass kernel) If you do it right, you minimize the amount of silicon that is shared between processes which has surprisingly large performance benefits. Linux has facilities that make this relatively straightforward too. As a consequence, adjacent cores on the same CPU have only marginally more interaction with each other than cores on different machines entirely. Treating a single server as a distributed cluster of 1-core machines, and writing the software in such a way that the operating system behavior reflects that model to the extent possible, is a great architecture for extreme performance but you rarely see it outside of closed source software. As a corollary, garbage-collected languages do not work for this at all. |
So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions.
Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory.