| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yaantc 477 days ago

On the L/S unit impact: data movement is expensive, computation is cheap (relatively).

In "Computer Architecture, A Quantitative Approach" there are numbers for the now old TSMC 45nm process: A 32 bits FP multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8 kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its tag comparison and LRU logic (more expansive).

Then I have some 2015 numbers for Intel 22nm process, old too. A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM 16.7 pJ. Basic SRAM here too, not a more expansive cache.

The cost of a multiplication is quadratic, and it should be more linear for access, so the computation cost in the second example is much heavier (compare the mantissa sizes, that's what is multiplied).

The trend gets even worse with more advanced processes. Data movement is usually what matters the most now, expect for workloads with very high arithmetic intensity where computation will dominate (in practice: large enough matrix multiplications).

5 comments

Remnant44 476 days ago

Appreciate the detail! That explains a lot of what is going on.. It also dovetails with some interesting facts I remember reading about the relative power consumption for the zen cores versus the infinity fabric connecting them - The percentage of package power usage simply from running the fabric interconnect was shocking.

link

Earw0rm 476 days ago

Right, but a SIMD single precision mul is linear (or even sub linear) relative to it's scalar cousin. So a 16x32, 512-bit MUL won't be even 16x the cost of a scalar mul, the decoder has to do only the same amount of work for example.

link

kimixa 476 days ago

The calculations within each unit may be, true, but routing and data transfer is probably the biggest limiting factor on a modern chip. It should be clear that placing 16x units of non-trivial size means that the average will likely be further away from the data source than a single unit, and transmitting data over distances can have greater-than-linear increasing costs (not just resistance/capacitance losses, but to hit timing targets you need faster switching, which means higher voltages etc.)

link

dzaima 476 days ago

Both Intel and AMD to some extent separate the vector ALUs and the register file in 128-bit (or 256-bit?) lanes, across which arithmetic ops won't need to cross at all. Of course loads/stores/shuffles still need to though, making this point somewhat moot.

link

eigenform 476 days ago

AFAIK you have to think about how many different 512b paths are being driven when this happens, like each cycle in the steady-state case is simultaneously (in the case where you can do two vfmadd132ps per cycle):

- Capturing 2x512b from the L1D cache

- Sending 2x512b to the vector register file

- Capturing 4x512b values from the vector register file

- Actually multiplying 4x512b values

- Sending 2x512b results to the vector register file

.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?

link

jiggawatts 476 days ago

… per core. There are eight per compute tile!

I like to ask IT people a trick question: how many numbers can a modern CPU multiply in the time it takes light to cross a room?

link

bgnn 475 days ago

Piggy backing on this: memory scaling was slowter than compute scaling, at least since 45nm in the example. For 4nm the difference is larger.

link

formerly_proven 476 days ago

Random logic had also much better area scaling than SRAM since EUV which implies that gap continues to widen at a faster rate.

link