I though having multiple cache levels was about a trade-off between performances and costs. The closer to the cpus (or the fater cache lvl), the more expensive it is.
Yes. I don't know where he gets the idea that a large L1 cache is to a CPU the same as a 150mx150m desk to a human. Address decoding is done in parallel, not sequentially. And desks are as large as people are comfortable to produce and use.
Likewise, if the RAM would be as cheap to produce as SRAM like it is as DRAM, it would be as fast as the CPU (since it is using the same technology as the CPU) and we would not need the cache at all. Imagine gigabytes of L1 cache!
Well, address decoding can be started in parallel if your page size lets you do virtually indexed, physically tagged caches which applies to only some processors. But that's a separate issue from the relationship between cache size and cache speed. That's governed by three things.
First, the larger your cache the more layers of muxing you need to select the data you need, meaning more FO4s of transistor delay.
Second, the larger your cache the physically bigger it is. That means more physical distance between the memory location and where it is used. That means more speed of light delay.
And third there's the issue of resolving contention for shared versus unshared caches.
So despite the fact that you're using the same SRAM in both your L1 and L3 but access to the former takes 4 clock cycle but access to the later takes 80.
There's also the fact that as you get down the cache hierachy the cache becomes more complicated. An L1 does lookups for a single processor, and responds to snoops. An L3 probably has several processors hanging it off and may deal with running the cache coherency protocol (e.g. implements a directory of what lines are where and sends clean or invalidation snoops when someone wants to upgrade a line from shared to unique). As a result you've got layers of buffering, arbitration and hazarding to get through before you can even touch the memory array.
> And desks are as large as people are comfortable to produce and use.
Think about what this implies though -- a desk that is too large becomes difficult for a person to use (for one, the person would have to start walking to access certain parts of it).
Likewise, L1 cache sizes are bounded, because the larger the cache becomes, the more difficult it is to address a particular location, and the cache also becomes physically larger such that speed-of-light propagation delays will slow the entire cache down.
No, a cell of L1 cache is exactly as expensive as a cell of L3 cache (ignoring weird stuff like eDRAM).
Now, SRAM is these days made with 6 or 8 transistors while the the DRAM you use in your main memory only takes 1 transistor per cell. Also your DRAM is built with a different sort of silicon process so it cheaper on a transistor to transistor basis. But generally the dollar cost is the same for memory in any given location.
Given a fixed area which fits a fixed number of transistors at the same cost, you allocate some portion of those transistors to compute and memory cells.
If you want to maintain your number of memory cells without decreasing the number of compute transistors, you need to grow your area which increases costs. That can be a very expensive thing here.
Additionally engineer time around layout and architectural costs are different for those different placements and cache requirements, so the cost is not uniform, but amortized it is not as significant as things like chip area.
Changing a chip from having 8kB of L1 Dcache to 16kB might be far more expensive in design terms than making a similar change in L3 cache but from a blank slate would either be more expensive to design in the first place? When I look at the layout of a late model x86 the regular structures of the caches stand out in the die photos among the irregular hand-tuned logic. Yes, there are follow on effects on the layout from changes in cache size but I don't see any reason a priori to say whether increasing the L1 size will tend to make designing the rest of the core logic harder or easier.
So I still don't see any reason to back off from saying that a cell of L1 costs as much as a cell of L3, modulo concerns about keeping the cache size a power of 2.
Likewise, if the RAM would be as cheap to produce as SRAM like it is as DRAM, it would be as fast as the CPU (since it is using the same technology as the CPU) and we would not need the cache at all. Imagine gigabytes of L1 cache!