|
|
|
|
|
by KolenCh
882 days ago
|
|
If you try to implement a simple matmul and you’ll understand. You can start with a naive one as a baseline. Then a few standard tricks can be used to speed it up. And then compare that to a standard BLAS call. And you’ll find that even with those tricks it is nowhere close to off-the-shell BLAS libraries. But from this exercise alone, knowing the tricks you used already, you can see how un-embarrassingly parallel this task is (frankly if it is truly embarrassingly parallel, then `#prama omp for` should bring you close to best possible performance already.) I don’t think your prof got it wrong, it is just that you misunderstood them. |
|