| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by junon 527 days ago

That's very true, though AFAIK they aren't used much in modern processors. It's usually PIPT or VIPT (I think I've seen more references to the latter), VIPT being prevalent because the logical address and the cache can be resolved in parallel when designing the circuitry.

But I've not designed CPUs nor do I work for $chip_manu so I'm speculating. Would love more info if anyone has it.

EDIT: Looks like some of the x86 manus have figured out a VIPT that has less downsides and behaves more like PIPT caches. I'd imagine "how" is more of a trade secret though. I wonder what ARM manus do, actually. Going to have to look it up :D

1 comments

tliltocatl 527 days ago

Original ARM (as in Acorn Risc Machine) did VIVT. Interestingly, to allow the OS to access the physical memory without aliasing, ARM1 only translated a part of address space (26 bits), the rest of it was always physical.

Nowdays, you don't see it exactly because of problems with aliasing. Hardware people would love to have these back because having to do shared-index is what limits L1 cache today. Hope nobody actually does it because this is a thing that you can't really abstract away and it interacts badly with applications that aren't aware of it.

Somewhat tangential, this is also true for other virtual memory design choices, like page size (apple silicon had problems with software that assumed 4096-byte pages). And I seriously wish for CPU designers not to be all to creative such hard-to-abstract things. Shaving some hundred transistors isn't really worth the eternal suffering upon everyone who have to provide compatibility for this. Nowdays it's generally recognised (RISC-V was quite conscious about it). Pre-AMD64 systems like Itanium and MIPS were total wild west about it.

Another example hard-to-abstract thing that is still ubiquitous is incoherent TLBs. It might have been the right choice back when SMP and multithreading was uncommon (a TLB flush on a single core is cheap), but it's certainly isn't true anymore with IPIs being super expensive. The problem is that it directly affects how we write applications. Memory reclamation is so expensive it's not worth it so nobody bothers. Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere. Which means it's hard to quantify the real-life benefit of TLB coherence, which makes it even more unlikely we ever get those.

link

fanf2 527 days ago

Original ARM (ARM1 and ARM2) were cacheless; ARM3 was the first with a cache.

The CPU’s 26 bit address space was split into virtually mapped RAM in the bottom half, and the machine’s physical addresses in the top half. The physical address space had the RAM in the lowest addresses, with other stuff such as ROMs, memory-mapped IO, etc. higher up. The virtual memory hardware was pretty limited: it could not map a page more than once. But you could see both the virtually mapped and physically addressed versions of the same page.

RISC OS used this by placing the video memory in the lowest physical addresses in RAM, and also mapping it into the highest virtual addresses, so there were two copies of video memory next to each other in the middle of the 26 bit address space. The video hardware accessed memory using the same address space as the CPU, so it could do fast full-screen scrolling by adjusting the start address, using exactly the same trick as in the article.

link

junon 527 days ago

Thanks for the information!

> Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere.

Curious if you could expand on this a bit; memcpy still requires that two buffers are mapped in anyway. Do you mean that avoiding maps is more important than avoiding copies? Or is there something inherent about multiple linear addresses -> same physical address that is somehow slower on modern processors?

link

tliltocatl 527 days ago

Assume an (untrusted) application A wants to send a stream of somewhat long (several tens of KB/multiple pages each) messages to application B. A and B could establish a shared memory region for this, but that would possibly allow A to trigger a TOCTOU vulnerability in B by modifying the buffer after B started reading the message. If page capability reclamation would have been cheap, the OS could unmap the shared buffer from A before notifying B of incoming message. But nowadays unmapping requires synchronizing with all CPUs that might have TLBs with A's mapping, so memcpy is cheaper.

link