Hacker News new | ask | show | jobs
by maksut 764 days ago
That is interesting. I wonder if L1 is denser because it has to have more bandwidth. But doesn't that point to a space constraint rather than money? A combination of L1 & L2 will have a bigger capacity so it would be faster than pure L1 cache in the same space (for some/most workloads)?

I always thought cache layers was because of locality but that is my imagination :) The article talks about different access patterns of the cache layers which makes sense in my mind.

It also mentions density briefly:

> Only what misses L1 needs to be passed on to higher cache levels, which don’t need to be nearly as fast, nor have as much bandwidth. They can worry more about power efficiency and density instead.

1 comments

> doesn't that point to a space constraint rather than money?

The space constraints are also caused by money. The reason we don't just add more L1 cache is that it would take up a lot of space, necessitating larger dies, which lowers yields and significantly increases the cost per chip.

I would say it's physics, not money.

Space constraints are caused by power and latency limits even with infinite money.

That isn't true at all. The limited speed at which a signal can propagate itself across a chip and the added levels of muxing necessarily mean that there's a limit to how low the latency of a cache can be that's roughly proportional to the square root of its capacity.
It actually is true. You're also right that physics would eventually constrain die size, but it isn't the bottleneck that's keeping CPUs at their typical current size. This should be pretty obvious from the existence of (rare and expensive) specialty CPU dies that are much larger than typical ones. They're clearly not physically impossible to produce, just very expensive. The current bottleneck holding back die sizes is in fact costs, since larger die sizes cause the inevitable blemishes to ruin larger chunks of your silicon wafer each, cratering yields.

> added levels of muxing necessarily mean that there's a limit to how low the latency of a cache can be

L1 cache avoids muxing as much as possible, which is why it takes up so much die space in the first place.

The path of loading data from L1 is one of the tightest, most timing-critical parts of a modern CPU. Every cycle of latency here has very clear, measurable impact on performance, and modern designs typically have 4-5 cycle L1 load-to-use. Current AMD cores do really well against Intel ones despite clocking lower and being weaker on most types of resources simply because they have a 1 cycle advantage. If you had literally infinite cheap transistors available, it would not be a good idea to spend them on the L1 cache, because this would make the cpu slower.

> L1 cache avoids muxing as much as possible, which is why it takes up so much die space in the first place.

Every time you double the size of a cache, you need to add a single extra mux on the access path. Simply to be able to select from which half of the cache you want the result. You also increase the distance that a signal needs to propagate, but I believe for L1 the muxes dominate.

Also it draws a huge amount of power.