Hacker News new | ask | show | jobs
by etep 3302 days ago
At a high level it's true that smaller is faster, but it's also true that those L1s could have grown by adding sets (not ways) and achieved the same latency. L2 has grown, but stayed iso-latency. This seems to say that "smaller is faster" does not always hold.

Always impressed that Agner Fog takes the time to publish his results. Pretty amazing. But I think focusing your thinking on the register count in MIPs or the the uarch for some random opcode does not get into the real constraints on L1 cache design at all. One could say that x86 should be even faster, because hey, far less than 32 registers (or historically at least).

My response is like this: yes, the L1 has to be small to be fast, but it has been stuck at 32KB forever now. It could have grown! So it's not as simple as small is fast.

2 comments

L1 size is probably constrained by a trade-off and competition for area between different CPU parts. If it can be increased with an overall positive effect on performance (while still being economically competitive to build), then I have no doubt Intel will do it... It is probably not crucial to have a big L1 on modern x86 arch because of very deep OOO queues, HT, speculative exec, prefetching, and all the other improvements on IPC and overall package perf that need to keep some efficiency even when L1 can't keep up anyway.

I also vaguely remember the Mill cpu guy talking about cache size constraints just because of the speed of light, but given node size has continued to decrease during the last decade while frequency has nearly stopped to increase, this might be less an issue than basic area optimizations. Or this might be an interesting consideration on Mill only because it is a radically different architecture, and needs different area ratios.

Only wild guesses though, I don't even have tried to confirm any of that with any kind of research or back of the envelop calculations.

x86_64 has 16 integer registers but Haswell has a 192 entry ROB. Skylake has 224. So Intel does increase these numbers. It's just that there has to be a good reason. In the 90s maybe something like clock speed could win a marketing spec battle. Not today.

I think at 6 transistors per bit we really aren't talking about a lot of die area. Still I'm stone cold certain the Intel architects would increase L1 cache size if that was beneficial, if it modeled out. (However they may want to keep performance similar+predictable unless there's a solid win.)

Agner is showing they've reduced L1 latency. So this smaller is faster seems to have gotten them something.

So you really have to work backwards and ask why they didn't/don't. There may more than one reason; but they don't and haven't in quite some time.

I'm old school assembly/compiler hack. I read Agner and the Intel Optimization Manual a lot. VTune, IACA and the PMCs. Someone has to do it.

I think maybe we were talking past each other. Yes there is more than one reason.

It's far easier to add capacity by adding sets, as opposed to ways. But they can't add sets in the L1 because of the aliasing problem. When they do increase L1 capacity, if nothing else has changed, then it will be by adding ways.