|
|
|
|
|
by marshallward
808 days ago
|
|
There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags. The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill. Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there. Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article. |
|
Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.
[0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...
[1] https://github.com/OpenMathLib/OpenBLAS
[2] https://www.intel.com/content/www/us/en/developer/tools/onea...
[3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...
[4]https://en.wikipedia.org/wiki/BLIS_(software)