Hacker News new | ask | show | jobs
by corysama 1581 days ago
This comes up over and over. It's great. But, the 75% of useful content comes after 25% of diving way too deep into the details of the electrical engineering.

"Every programmer" should know the orders of magnitude of cache hierarchy latencies, how RAM loads a whole cache line to service that single byte you requested, roughly how the automatic prefetcher thinks, that MESI and NUMA access are a growing issue, that the TLB cache is a thing, and generally how the memory controller is the interface between the CPU and pretty much everything else --like the NIC, HD and GPU.

"Every programmer" does not need to know about DRAM discharge timing, row selection and refresh cycles.

Understanding quad-pumped bus frequencies and CAS latencies is great when you are building systems. But, it's not something you think about when designing data structures and algos.

6 comments

Long ago when I worked on real-time digital signal processing I did have to worry about DRAM discharge timing because the board I was using had a non-maskable interrupt to do the RAM refresh, and this limited the rate at which I could access the A/D converter. I think it was an LSI-11 board if I remember right. Fun times.

But yes, unless you're doing bare-metal embedded systems you haven't needed to care in a long time ... at least until someone came up with Rowhammer.

The ones who have the skills to actually use this information will end up reading the whole thing anyway, there's an element of self selection here.

You don't need to know the DRAM timing stuff, technically, but it doesn't hurt to learn something new.

The actual problem with this doc is that it was written a very long time ago so although most things are very very true still a few things (like everything on the quality of prefetching techniques should be taken with a large grain of salt)

It's also wrong about some basic details AFAIKS. A "memory controller" typically has more than one memory channels and not always ganged, so independent concurrent accesses even in a single controller can be supported. DRAM can also have multiple banks which can be accessed independently. So there is certainly not only one "bank" per north bridge or one bank per ODMC.
on modern systems the memory controller interfaces between the memory network and the DRAM. I wouldn't say the CPU/NIC or CPU/storage boundary touches the memory controller (except if its writing to memory)
It does. Outside of MMIO which nowadays is just the control plane for device configuration and the main processing state machines (e.g., starting and stopping processing of command and completion packets in ring buffers in memory). So those commands which are the primary control plane for the data operations are even all in memory! Then the data operations themselves are all memory too of course.

It is possible for the PCI host bridge DMAs to load and store into caches, but in practice it can be difficult if not impossible to line everything up so the data is in cache when it is required, because of the data throughput and pipelining (many parallel pipelines) latency variations even on local NAND devices, etc.

Maybe you get your command/complete rings from cache (which would be nice since they have to come in order and the CPU has to operate on), but it's very hard to get all your data served by the caches.

The old favorite netflix serving talks show this

https://people.freebsd.org/~gallatin/talks/euro2021.pdf

My limited understanding is that the CPUs have two most-common interfaces to the rest of the system:

1) Reading and writing to memory addresses in such a way as to be interpreted by the memory controller to forward those actions to/from the PCI bus and other systems. That forwarding being controlled by reading and writing to addresses in a way that the MC interprets as commands to configure itself.

2) Ports -- which still exist for legacy reasons, but are long out of fashion.

What's #3?

The CPU core/cache uses its physical addresses to route loads and stores, but they don't have to go to the memory controller on the chip. They could go to the SMP unit if the memory controller for the data is on another chip. Or a PCI host bridge on the chip. Either directly to its register space, or to the register space or memory that belongs to a device behind it. These addresses are called "MMIO" memory mapped IO and are the way configuration is done. They are also slow, low bandwidth, synchronous.

x86 has "ports". I don't know about modern implementations but I would guess they are done with MMIO out the back end of the core (e.g., the CPU turns inb/outb instructions into accesses to special memory ranges).

Either way MMIO is the way to configure devices.

DMA is how to move data between device and CPU or devices. But nowadays with high performance devices, you don't set up a request and then send a MMIO command to process that request, and then get an interrupt and do a MMIO to get completion status of the command. The commands and completions themselves are DMA'ed. So you have ring buffers (multiple for a multi queue or virtualized device) for commands and completions which get DMAed. You do the MMIOs and interrupts only to manage starting and stopping (fill, empty, etc) conditions of the queues.

DMAs can go direct to caches in some architectures, but as I said L3 caches are only so large, and data volumes so large that it can be hard to arrange. You would hope your queues are mostly cached at least, but in reality I don't know if that actually happens well.

Quite Right!

Any resources on where i can read up on these ?

Well thanks for this summary :-).