Hacker News new | ask | show | jobs
by Earw0rm 481 days ago
Right, but a SIMD single precision mul is linear (or even sub linear) relative to it's scalar cousin. So a 16x32, 512-bit MUL won't be even 16x the cost of a scalar mul, the decoder has to do only the same amount of work for example.
1 comments

The calculations within each unit may be, true, but routing and data transfer is probably the biggest limiting factor on a modern chip. It should be clear that placing 16x units of non-trivial size means that the average will likely be further away from the data source than a single unit, and transmitting data over distances can have greater-than-linear increasing costs (not just resistance/capacitance losses, but to hit timing targets you need faster switching, which means higher voltages etc.)
Both Intel and AMD to some extent separate the vector ALUs and the register file in 128-bit (or 256-bit?) lanes, across which arithmetic ops won't need to cross at all. Of course loads/stores/shuffles still need to though, making this point somewhat moot.