| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Symmetry 3604 days ago

Well, address decoding can be started in parallel if your page size lets you do virtually indexed, physically tagged caches which applies to only some processors. But that's a separate issue from the relationship between cache size and cache speed. That's governed by three things.

First, the larger your cache the more layers of muxing you need to select the data you need, meaning more FO4s of transistor delay.

Second, the larger your cache the physically bigger it is. That means more physical distance between the memory location and where it is used. That means more speed of light delay.

And third there's the issue of resolving contention for shared versus unshared caches.

So despite the fact that you're using the same SRAM in both your L1 and L3 but access to the former takes 4 clock cycle but access to the later takes 80.

1 comments

gchadwick 3604 days ago

There's also the fact that as you get down the cache hierachy the cache becomes more complicated. An L1 does lookups for a single processor, and responds to snoops. An L3 probably has several processors hanging it off and may deal with running the cache coherency protocol (e.g. implements a directory of what lines are where and sends clean or invalidation snoops when someone wants to upgrade a line from shared to unique). As a result you've got layers of buffering, arbitration and hazarding to get through before you can even touch the memory array.