| HN Mirror

No, I'm asking about off-die SRAM as a replacement for off-die DRAM, not on-die DRAM as an alternative to {off die DRAM, cache, cores, etc}. There are a bunch of tradeoffs to be made on-die, and I get the reasoning behind them even if I don't know specific numbers. X86 has to enforce permissions, handle sharing, cross-reference a TLB, etc and you can win significantly by memorizing results subject to statistically unlikely invalidation. There would be separate L1/L2/L3 even if all SRAM cells had identical latency and density. Which they might, I don't know. L4 (eDRAM, what you were talking about) gives you a huge density advantage, but it's still not competitive with SRAM for speed, even though it's on the same process:

http://www.sisoftware.co.uk/?d=qa&f=mem_hsw

    L1:     4 clocks  <-- SRAM
    L2:    12 clocks  <-- SRAM
    L3:    36 clocks  <-- SRAM
    L4:   136 clocks (55ns) <-- eDRAM
    DRAM: 193 clocks (80ns) <-- off-die DRAM
    Clock: 2.5GHz (dynamic overclocking was disabled)
    5cm travel: 1 clock

With SRAM you just have to open the right gate, whereas with DRAM you have to precharge the bitlines, open the word line, wait for the tiny signal to amplify up to logic level, and only then do you get to read it out. Worse, you need tons of logic to re-order memory access to take advantage of multiple accesses on the same word line or that can happen simultaneously in different banks. And you need to refresh each word line periodically, which requires even more logic. There is a reason why the memory controller (not the cache, the controller) is a huge chunk of the die roughly the size of 2 cores!

If we assume that L3 and L4 have similar management overhead then this all takes ~100 clock cycles in the comparison above, which dominates the other costs even if we disregard savings due to simpler logic in off-die SRAM (which, when combined with travel time, accounts for 60 cycles).

I still don't understand why off-die SRAM isn't sensible.