| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brrrrrm 1327 days ago

avoiding writes to memory and reducing the number of loops (although not FLOPs)

    for j in range(10):
      c[j] = a[j] + b[j]
    for j in range(10):
      d[j] = c[j] * 2

becomes

    for j in range(10):
      d[j] = (a[j] + b[j]) * 2

1 comments

thrtythreeforty 1327 days ago

Or, better, identifying that the machine has a primitive that is better than doing each op individually. For example, a multiply-accumulate instruction vs a multiply and separate accumulate. The source code still says "a*b+c", the compiler is just expected to infer the MAC instruction.

link

brrrrrm 1327 days ago

Yep! This is an assumed optimization when it comes to modern linear algebra compilers. New primitives go way beyond FMAs: full matrix multiplies on nvidia/Intel and outer product accumulates on Apple silicon. It’s also expected that these are used nearly optimally (or you’ve got a bug).

link

thrtythreeforty 1326 days ago

I am extremely familiar with how far these primitives go, ha. I develop kernels professionally for AWS ML accelerators.

link