Hacker News new | ask | show | jobs
by ncmncm 2466 days ago
Well, OK. All this is about high-throughput, low-latency systems.

The principle is decouple, decouple, decouple. Memory isn't just memory, it's paged and mapped, and the mappings are in a small cache called the TLB, one for each core. Each "hugetlb" page, 2MB or 1GB on x86, takes just one such cache entry, so anything big, like buffers, should live in hugepages.

A ring buffer is a kind of queue with just a head, and one writer. Each new item goes at the next place in the buffer, round-robin. A head pointer -- if it's in shared memory, an index -- gets updated "atomically" to point to the newest item. Downstream readers poll for updates to the head. New stuff overwrites the oldest stuff, so downstream readers can look until it gets overwritten, and can often avoid copying. They don't need to lock anything, but need to check that the head hasn't swept in and and overwritten what they were looking at; that is called being lapped. It is their responsibility to keep up, and prevent this.

Because there is never any question where the next entry goes, hardware devices understand ring buffers, and can be set to write to them whenever there is data. Typically a proprietary library talks to a proprietary driver to set this up, and then the hardware device runs free with no more interaction. (io_uring, AF_XDP, libexanic, ef_vi, DPDK, PF_RING, netmap, etc.)

Usually the hardware ring buffer is pretty small, a few MB, so for high-rate flows there might be cores dedicated to copying from it to one or more much, much bigger ring buffers in shared, mapped memory. Typically, multiple downstream readers watch for interesting traffic to show up on such a ring, splitting the work out to multiple cores.

Threads famously interfere with one another, mainly when competing for locks; but also, whenever they fool with the memory map, other threads may experience TLB cache stalls. Separate programs are better isolated, and can be further isolated by running on a dedicated core ("isolcpu", "NOHZ", and "taskset") that is protected against the OS sticking other threads on it, or vectoring interrupts to it. In extreme cases a core may offload its own RCU retirements, or even not run any kernel code.

A unikernel may run on such a core, running a single program, so what it thinks are system calls just call a static library. There is a lot of work going on on variations on this theme -- exokernels, parakernels, etc.

Instead of getting the file system and buffer cache all mixed up in your program, you can append to files with O_DIRECT writes, or store to mapped memory and let the kernel expose it to other processes, and spool to disk, asynchronously. A monitoring process can look at event counters in such memory as they are updated in real time. It is generally better if the program updating the counters also stores a generic description of them -- type, name, a hierarchical structure that can be read out to a JSON record, periodically, by a separate program. That might be written to a log and/or feed a status dashboard. Thus, the code doing the work just updates memory words pointed to from its working configuration, but doesn't ever need to format or write out updates. If there is any actual text logging, it goes through another ring buffer to a background logging process that, ideally, is also responsible for formatting.

Memory management -- new and delete -- is a source of unpredictable delays. Such allocations are always OK during startup, but often not after. A function that needs memory, then, should use memory provided by its caller. The top level can handle memory deterministically, pre-allocated or on the stack, with a global view of program behavior.

Using separate processes enables starting and stopping downstream processing independently, and isolates crashes. Ring buffers being read are always mapped read-only, so a crashed reader cannot corrupt any shared state.

2 comments

This is great! These are some concrete and non-trivial architectural techniques for "Systems Programming" :-)

I have had opportunity to work on/with some of these techniques on Fast Network Protocol/Security Appliances and so have some familiarity with them. However some of your hints(breadcrumbs?) are not known to me and hence i have something to research and study. Thank you.

PS: Can you add some more details on the above techniques? Like System/Library/API calls to look into, books/papers/articles to read etc?

Another hint.

Speaking of breadcrumbs, if the ring buffer has fixed-size entries, a reader can come in later and start reading old entries first, say halfway back. This is helpful if you want to start a new reader and then kill an old one, and not skip any entries.

It helps if the ring has power-of-two size, and the head pointer/index is 64 bits and increases monotonically. Then the high bits are easily masked off on each use, so that arithmetic on pairs of positions is simpler.

For variable-sized entries, an array of N "breadcrumbs", past positions near 1/Nth indices, allows jumping in at earlier positions. If traffic is low enough, you might be able to buffer a whole day's traffic, and get random access starting from breadcrumbs; otherwise, you can log old entries to a sequential file, and also log the breadcrumbs, translated to file offsets, as a global index.

Downstream processes can each sequentially log an individual field of each record, with a breadcrumb index to enable full records to be reconstructed. Often these column logs can be compressed, with enormous efficiency, between breadcrumbs: 98% compression may be easy to achieve for slowly-changing or limited-alphabet values.

Lz4 and Zstd are excellent compression engines. Lz4 really shines for fast decompression. There is no excuse for zlib/gz compression anymore.

Thanks for the hints. For one product that i had worked on, we had something like the above for runtime logging/debugging. A shared memory area (i.e. address range in Linux process space where shared libraries are loaded) at a fixed address was reserved via linker scripts with each process having its ring buffer at its own fixed offset. A separate reader process interacted with the CLI to provide comprehensive access to this data. It was all robust and worked quite well.

Unfortunately, these sorts of practical techniques are not known to many programmers and it would be nice if somebody (eg. you :-) were to list it on a website/book with some sample code for everybody's benefit.

This is all fascinating. Do you have any recommended reading on designing said high-throughput, low-latency systems?
You might find the following useful.

* Network Algorithmics,: An Interdisciplinary Approach to Designing Fast Networked Devices - https://www.amazon.com/Network-Algorithmics-Interdisciplinar...

* See MIPS Run - https://www.amazon.com/Morgan-Kaufmann-Computer-Architecture...

* UNIX Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers - https://www.amazon.com/UNIX-Systems-Modern-Architectures-Mul...

* Advanced UNIX Programming - https://www.amazon.com/Advanced-UNIX-Programming-Marc-Rochki...

I don't know of any. Maybe the people who know this are too busy building them. I might be in trouble for writing as much as I did. :-)