|
|
|
|
|
by moonchild
1386 days ago
|
|
IIRC I measured something like 10-50% performance difference (don't remember exactly, but it was somewhere in there), vs a reasonably well-regarded blas implementation. This was for dgemm specifically; I don't know if the story changes for smaller floats. |
|
I've been paying attention to what Apple has been pushing with their M1/M2 chips, and I'm pretty tempted to try it out, but unless these features are documented and supported I can't feel comfortable writing programs relying on them.