Hacker News new | ask | show | jobs
by CalChris 3298 days ago
I think a simpler argument is that for L1 you want fast, not big. Same thing with registers (a form of cache at a lower level). Why did MIPS only have 32 registers?

Design Principle 2: Smaller is faster. [1]

BTW, if you look at Agner Fog's latency tables [2], mov mem,r (load) went from 3 cycles in Haswell to 2 cycles in Skylake. So Intel has been concentrating on faster which is nice.

And by way of comparison, AMD increased their μop cache size in Ryzen but then only slightly. Way size went from 6 μops to 8. This matches their increase in EUs.

[1] Patterson and Hennessy. Computer Organization and Design, 5th edition, p. 67.

[2] http://www.agner.org/optimize/instruction_tables.pdf

2 comments

At a high level it's true that smaller is faster, but it's also true that those L1s could have grown by adding sets (not ways) and achieved the same latency. L2 has grown, but stayed iso-latency. This seems to say that "smaller is faster" does not always hold.

Always impressed that Agner Fog takes the time to publish his results. Pretty amazing. But I think focusing your thinking on the register count in MIPs or the the uarch for some random opcode does not get into the real constraints on L1 cache design at all. One could say that x86 should be even faster, because hey, far less than 32 registers (or historically at least).

My response is like this: yes, the L1 has to be small to be fast, but it has been stuck at 32KB forever now. It could have grown! So it's not as simple as small is fast.

L1 size is probably constrained by a trade-off and competition for area between different CPU parts. If it can be increased with an overall positive effect on performance (while still being economically competitive to build), then I have no doubt Intel will do it... It is probably not crucial to have a big L1 on modern x86 arch because of very deep OOO queues, HT, speculative exec, prefetching, and all the other improvements on IPC and overall package perf that need to keep some efficiency even when L1 can't keep up anyway.

I also vaguely remember the Mill cpu guy talking about cache size constraints just because of the speed of light, but given node size has continued to decrease during the last decade while frequency has nearly stopped to increase, this might be less an issue than basic area optimizations. Or this might be an interesting consideration on Mill only because it is a radically different architecture, and needs different area ratios.

Only wild guesses though, I don't even have tried to confirm any of that with any kind of research or back of the envelop calculations.

x86_64 has 16 integer registers but Haswell has a 192 entry ROB. Skylake has 224. So Intel does increase these numbers. It's just that there has to be a good reason. In the 90s maybe something like clock speed could win a marketing spec battle. Not today.

I think at 6 transistors per bit we really aren't talking about a lot of die area. Still I'm stone cold certain the Intel architects would increase L1 cache size if that was beneficial, if it modeled out. (However they may want to keep performance similar+predictable unless there's a solid win.)

Agner is showing they've reduced L1 latency. So this smaller is faster seems to have gotten them something.

So you really have to work backwards and ask why they didn't/don't. There may more than one reason; but they don't and haven't in quite some time.

I'm old school assembly/compiler hack. I read Agner and the Intel Optimization Manual a lot. VTune, IACA and the PMCs. Someone has to do it.

I think maybe we were talking past each other. Yes there is more than one reason.

It's far easier to add capacity by adding sets, as opposed to ways. But they can't add sets in the L1 because of the aliasing problem. When they do increase L1 capacity, if nothing else has changed, then it will be by adding ways.

Increasing the register count spends opcodes. That leads to less available instructions, or at a minimum constraints opcode optimization.
As we saw with AMD64, x86 is a variable length ISA, up to 15 bytes long, allowing for quiet a flexible (and complex) encoding. With a fixed width RISC, yeah registers are going to eat into opcode space. And in both cases, register renaming will allow more renamed (180) registers than architectural registers.

BTW, renamed != ROB. I got that wrong above.

In varying length architectures it will constrain opcode optimization, making your binaries larger (requiring more cache). It's not as big a problem as in fixed length instruction machines, but adding named registers is never free.