| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adgjlsfhk1 104 days ago
	This the the most cursed part of modern cpu design, but the TLDR is that programs use virtual addresses while CPUs use physical addresses which means that CPU caches need to include the translation from virtual to physical adress. The problem is that for L1 cache, the latency requirement of 3-4 cycles is too strict to first do a TLB lookup and then an L1 cache lookup, so the L1 can only be keyed on the bits of ram which are identical between physical and virtual addresses. With a 4k page size, you only have 6 bits between the size of your cache line (64 bytes) and the size of your page, which means that at an 8 way associative L1D, you only get 64 buckets*64 bytes/bucket=32 kbits of L1 cache. If you want to increase that while keeping the 4k page size, you need to up the associativity, but that has massive power draw and area costs, which is why on x86, L1D on x86 hasn't increased since core 2 duo in 2006.

2 comments

joha4270 104 days ago

Can you not take some of those virtual bits and get more buckets that way? I am sure it will make things more complicated if nothing else by them possibly being mapped to the same physical page, but it doesn't sound like an impossible barrier. Maybe something terrible where a cache line keeps bouncing between different buckets in the rare case that does happen, but as long as you can keep the common case as fast...

Otoh L1 sizes hasn't increased since my first processor, those CPU designers probably know more than I do.

link

dmitrygr 104 days ago

that will break if any page is mapped at two VAs, you'll end up with conflicting cache lines for the same page...

link

joha4270 104 days ago

The L2 already keeps track of what lines are somewhere in L1's for managing coherency.

Divide the cache into "meta-caches" indexed by the virtual bits and treat them as separate from the L2's point of view. Duplicate the data and if somebody writes back invalidate all the other copies. The hardware already exists for doing this on any multicore system. Sure, you will end up duplicating data sometimes and it will actually be slower if you're actually writing to aliased locations. But is this happening often enough to be a problem compared to generally having a bigger cache?

It sounds to me like an engineering tradeoff that might or might not make sense, not a hard limit which at least was what I think was being asserted. But as I also said, L1 sizes hasn't increased in a while and smart people are working on it, so there is probably something I don't know.

link

dmitrygr 104 days ago

this "divide" thing will add latency which you really do not want to add to L1 hits

link

rslashuser 104 days ago

Nice HN explanation! One hopes we will not be living with 4kb pages forever, and perhaps L1 performance will be one more reason.

link

tliltocatl 104 days ago

I'd really hope we do live with 4kb pages forever. Variable page size would make many remapping optimizations (i. e. continuous ring buffers) much harder to do, so we would need more abstraction layers, and more abstraction layers will eat away all the performance gains while also making everything more fragile and harder to understand. Hardware people really love those "performance hacks" that make live a more painful for the upper layers in exchange for a few 0.1%s of speed. You could also probably gain some speed by dropping byte access and saying the minimal addressable unit is now 32 bits. Please don't. If you need larger L1 cache - just increase associativity.

link

adgjlsfhk1 104 days ago

The extra L1 cache from a 64k page is on it's own a ~5-10% perf improvement (and it decreases power use by reducing the number of times you go out to L2.

link

spijdar 104 days ago

Funny, most of what you described sums up the Alpha architecture. 8KB pages + huge pages and, initially, only word-addressable memory, no byte access.

(Of course, it only took a few years for this to be rectified with the byte-word extension, which became required by ~all "real software" that supported Alpha)

It's also one of the only architectures Windows NT supported that didn't have 4KB pages, along with Itanium. I've wondered how (or if?) it handled programs that expect 4KB pages, especially in the x86 translation subsystem.

link