Hacker News new | ask | show | jobs
by microtherion 5274 days ago
BLAS is an API. Just because the reference implementation does things in a particular way does not necessarily mean that that your vendor implementation cannot do things differently (e.g. Apple's numerics team has spent a considerable amount of time tuning BLAS for Apple hardware).

For the particular problem discussed in the original article, there are at least two ways the multiplication A'A could potentially be made faster:

1) The blas _GEMM matrix multiplication routine lets you specify whether input matrix arguments are supposed to be transposed. This gets rid of the explicit transposition, AND in this problem, it lets you compute each element as the dot product of two unit stride vectors, instead of a dot product of an unit stride vector with a nonunit stride vector. For SSE, this makes a huge difference.

2) For the particular case A'A, there is the even more specialized _SYRK routine, which at the very least should be cache friendlier than a naive _GEMM (_GEMM could also figure out that it can use _SYRK for this problem, and presumably it does so in some implementation)

3 comments

I dont think I claimed that a vendor "cannot do things differently". Not sure where the downvotes came from, so just clarifying.

I would also hazard a guess that the tuning that you mention does not involve a change in the algorithm but are essentially reordering the steps of the algorithm to obtain better caching. I was responding to the parent post which conjectured that BLAS implementations use different algorithms. If you look at your two suggestions, none of them actually change the complexity class of the number of floating operations, but wall clock time oh absolutely.

Although there are matrix multiplication algorithms that have a complexity less than O(N^3) the constants for these are so large enough that the sizes of matrices for which there will be any appreciable benefit are extremely rare to come by.

Apple did not spent a considerable amount of time tunning BLAS - they just took Atlas, an widely used open source implementation of blas/lapack (and took care of making sure it does not look like ATLAS at first glance, although a simple look at nm will show it is atlas all the way down).
The OS X BLAS was once ATLAS all the way down (and I see no reason to believe that any effort was made to hide that fact), but that's no longer true. A glance at nm on Lion shows that there's some ATLAS in there, but quite a lot of not-ATLAS too.
It is mentioned nowhere in accelerate framework that it uses atlas, and they removed functions like ATL_buildinfo. I doubt it has changed much in the last 5 years as well, and I am not sure what makes you think there is ( significant) non-atlas code in there ? Besides symbols starting with ATL, I see mostly the BLAS/LAPACK API, although I did not try to check very carefully.
Besides the ATL-prefixed symbols, there are are numerous APL-prefixed symbols, as well as many _block_ symbols which indicate extensive use of GCD in the level 3 routines (and which ATLAS doesn't use).
The gemm -> syrk optimization is very common (I would expect most tuned BLAS libraries to take advantage of it -- all that I have inspected do so).