| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jandrewrogers 3929 days ago

Over the last decade, the distributed system nature of modern server hardware internals has become painfully evident in how software architectures scale on a single machine. The traditional approaches -- multithreading, locking, lock-free structures, etc -- are all forms of coordination and agreement in a distributed system, with the attendant scalability problems if not used very carefully.

At some point several years ago, a few people noticed that if you attack the problem of scalable distribution within a single server the same way you would in large distributed systems (e.g. shared nothing architectures) that you could realize huge performance increases on a single machine. The caveat is that the software architectures look unorthodox.

The general model looks like this:

- one process per core, each locked to a single core

- use locked local RAM only (effectively limiting NUMA)

- direct dedicated network queue (bypass kernel)

- direct storage I/O (bypass kernel)

If you do it right, you minimize the amount of silicon that is shared between processes which has surprisingly large performance benefits. Linux has facilities that make this relatively straightforward too.

As a consequence, adjacent cores on the same CPU have only marginally more interaction with each other than cores on different machines entirely. Treating a single server as a distributed cluster of 1-core machines, and writing the software in such a way that the operating system behavior reflects that model to the extent possible, is a great architecture for extreme performance but you rarely see it outside of closed source software.

As a corollary, garbage-collected languages do not work for this at all.

7 comments

pron 3928 days ago

I think that's a generalization that simply shifts the burden elsewhere, and cannot be said to be "the right" architecture in general. There is a reason CPUs implement cache-coherence on top of their "innate" shared-nothing design, and the reason is abstraction. If you don't need certain abstractions, then a sharded approach is indeed optimal, but if you do, then you have to implement them at some level or another, and it's often better to rely on hardware messaging (cache-coherence) than software messaging.

So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions.

Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory.