|
Well, OK. All this is about high-throughput, low-latency systems. The principle is decouple, decouple, decouple. Memory isn't just memory, it's paged and mapped, and the mappings are in a small cache called the TLB, one for each core. Each "hugetlb" page, 2MB or 1GB on x86, takes just one such cache entry, so anything big, like buffers, should live in hugepages. A ring buffer is a kind of queue with just a head, and one writer. Each new item goes at the next place in the buffer, round-robin. A head pointer -- if it's in shared memory, an index -- gets updated "atomically" to point to the newest item. Downstream readers poll for updates to the head. New stuff overwrites the oldest stuff, so downstream readers can look until it gets overwritten, and can often avoid copying. They don't need to lock anything, but need to check that the head hasn't swept in and and overwritten what they were looking at; that is called being lapped. It is their responsibility to keep up, and prevent this. Because there is never any question where the next entry goes, hardware devices understand ring buffers, and can be set to write to them whenever there is data. Typically a proprietary library talks to a proprietary driver to set this up, and then the hardware device runs free with no more interaction. (io_uring, AF_XDP, libexanic, ef_vi, DPDK, PF_RING, netmap, etc.) Usually the hardware ring buffer is pretty small, a few MB, so for high-rate flows there might be cores dedicated to copying from it to one or more much, much bigger ring buffers in shared, mapped memory. Typically, multiple downstream readers watch for interesting traffic to show up on such a ring, splitting the work out to multiple cores. Threads famously interfere with one another, mainly when competing for locks; but also, whenever they fool with the memory map, other threads may experience TLB cache stalls. Separate programs are better isolated, and can be further isolated by running on a dedicated core ("isolcpu", "NOHZ", and "taskset") that is protected against the OS sticking other threads on it, or vectoring interrupts to it. In extreme cases a core may offload its own RCU retirements, or even not run any kernel code. A unikernel may run on such a core, running a single program, so what it thinks are system calls just call a static library. There is a lot of work going on on variations on this theme -- exokernels, parakernels, etc. Instead of getting the file system and buffer cache all mixed up in your program, you can append to files with O_DIRECT writes, or store to mapped memory and let the kernel expose it to other processes, and spool to disk, asynchronously. A monitoring process can look at event counters in such memory as they are updated in real time. It is generally better if the program updating the counters also stores a generic description of them -- type, name, a hierarchical structure that can be read out to a JSON record, periodically, by a separate program. That might be written to a log and/or feed a status dashboard. Thus, the code doing the work just updates memory words pointed to from its working configuration, but doesn't ever need to format or write out updates. If there is any actual text logging, it goes through another ring buffer to a background logging process that, ideally, is also responsible for formatting. Memory management -- new and delete -- is a source of unpredictable delays. Such allocations are always OK during startup, but often not after. A function that needs memory, then, should use memory provided by its caller. The top level can handle memory deterministically, pre-allocated or on the stack, with a global view of program behavior. Using separate processes enables starting and stopping downstream processing independently, and isolates crashes. Ring buffers being read are always mapped read-only, so a crashed reader cannot corrupt any shared state. |
I have had opportunity to work on/with some of these techniques on Fast Network Protocol/Security Appliances and so have some familiarity with them. However some of your hints(breadcrumbs?) are not known to me and hence i have something to research and study. Thank you.
PS: Can you add some more details on the above techniques? Like System/Library/API calls to look into, books/papers/articles to read etc?