| On the L/S unit impact: data movement is expensive, computation is cheap (relatively). In "Computer Architecture, A Quantitative Approach" there are numbers for the now old TSMC 45nm process: A 32 bits FP multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8 kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its tag comparison and LRU logic (more expansive). Then I have some 2015 numbers for Intel 22nm process, old too. A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM 16.7 pJ. Basic SRAM here too, not a more expansive cache. The cost of a multiplication is quadratic, and it should be more linear for access, so the computation cost in the second example is much heavier (compare the mantissa sizes, that's what is multiplied). The trend gets even worse with more advanced processes. Data movement is usually what matters the most now, expect for workloads with very high arithmetic intensity where computation will dominate (in practice: large enough matrix multiplications). |