| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ique 5274 days ago

Even if we don't consider the difference in data structures here, they use wildly different algorithms. Numpy does all the matrix calculations by outsourcing it to BLAS[1] routines that are a mix of C/Assembly, just like the answers detail.

BLAS is not only written in more efficient code, it's different algorithms altogether. BLAS can do a lot of optimizations that brings the total FLOP count to below what's usually considered required for matrix multiplication. (2m*n^2)

[1]: http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogram...

3 comments

cdavid 5274 days ago

NumPy dev here.

Note that numpy may not use BLAS (this was done though as to avoid any hard dependencies on 3rd party libraries). I am not sure what you mean by putting the FLOP count below what's required. BLAS will still need O(N^3) operations for a NxN matrix multiplications, whether they are optimized or not. The biggest difference between libraries is usually in clever data organization/passing to use the cpu cache as efficiently as possible (memory throughput is usually the bottleneck until your data don't fit in RAM). You can easily gain one order of magnitude using MKL compared to a naive implementation in C (that you should never, ever do, BTW).

link

tzs 5273 days ago

> BLAS will still need O(N^3) operations for a NxN matrix multiplications, whether they are optimized or not.

Why wouldn't they use an algorithm that is better than O(N^3)?

link

stephencanon 5272 days ago

Two reasons: first, the matrix multiply algorithms with exponents less than three do not have the same numerical stability properties, which can introduce subtle bugs into software that was developed with the expectation of a "usual" O(n^3) multiplication being used. This makes it unsuitable for use in a general-purpose library.

Second, although algorithms with smaller exponents exist, there is more to high-performance than asymptotic complexity. In particular, the constant factors associated with these "fast" algorithms are typically large enough that there is no benefit to using them for "reasonable" matrix sizes.

link

sritchie 5274 days ago

Ah, this is really interesting. I knew that far more work had gone into the code NumPy rides on (of course), but not really much more than that.

I rewrote my clojure code to use JBLAS (jblas.org) and found stunning improvements: https://gist.github.com/264a2756fc657140fdb8

link

pron 5274 days ago

Well, Incanter (http://incanter.org/) uses Parallel Colt (https://sites.google.com/site/piotrwendykier/software/parall...) which, in turn and if I'm not mistaken, uses BLAS, LAPACK etc.

link

srean 5274 days ago

Are you sure that BLAS uses different matrix multiplication algorithm ?

I havent looked at the actual code in a while, but from what I have seen they use the standard vanilla matrix multiplication algorithm whose complexity is no less than O(N^3) (for square matrices).

They achieve the speed by exploiting the deep cache structure of modern machines. Essentialy by blocking (sometimes unrolling) loops and consuming data in a cache aware fashion. Given the latency in accessing the data directly from memory, getting the data off the cache can easily speed things up by 10s if not 100s of folds.

In otherwords, its the same algorithm just implemented more efficiently.

On the other hand is the interesting question if linear algebra can be done fast in a lisp without calling BLAS. I do not have the reference on the top of my head but there were a few papers on ACM where it was shown that lisp can indeed rival fortran speed in linear algebraic computation. I will try and dig it up later. In anycase I suspect that stalin will do a decent job optimizing a matrix multiplication routine written in lisp. I suspect that a part of the slowdown in clojure is also because it runs on jvm whose safety guarantees (bounds checking and no reordering of instructions) can come with a speed hit.

EDIT: here it is http://www.cs.berkeley.edu/~fateman/papers/lispfloat.ps

From the abstract:

  Lisp, one of the oldest higher-level programming languages
  has rarely been used for fast numerical (floating-point)
  computation. We explore the benefits of Common Lisp, an
  emerging new language standard with some excellent
  implementations, for numerical computation.
  We compare it to Fortran in terms of the speed of
  generated code, as well as the structure and convenience
  of the language. There are a surprising number of
  advantages to Lisp, especially in cases where a mixture
  of symbolic and numeric processing is needed

And the conclusion:

  In this article we have asserted, and shown through a
  number of examples, that numerical computing in Lisp need
  not be slow, and that many features of Common
  Lisp are useful in the context of numerical computing. We
  have demonstrated that the speed of compiled Common Lisp
  code, though today somewhat slower than that of the best
  compiled Fortran, could probably be as efficient, and in
  some ways superior. We have suggested ways in which the
  speed of this code might be further improved.

Should be read with http://dl.acm.org/citation.cfm?id=235815.235824 (this is probably pay-walled).

link

microtherion 5274 days ago

BLAS is an API. Just because the reference implementation does things in a particular way does not necessarily mean that that your vendor implementation cannot do things differently (e.g. Apple's numerics team has spent a considerable amount of time tuning BLAS for Apple hardware).

For the particular problem discussed in the original article, there are at least two ways the multiplication A'A could potentially be made faster:

1) The blas _GEMM matrix multiplication routine lets you specify whether input matrix arguments are supposed to be transposed. This gets rid of the explicit transposition, AND in this problem, it lets you compute each element as the dot product of two unit stride vectors, instead of a dot product of an unit stride vector with a nonunit stride vector. For SSE, this makes a huge difference.

2) For the particular case A'A, there is the even more specialized _SYRK routine, which at the very least should be cache friendlier than a naive _GEMM (_GEMM could also figure out that it can use _SYRK for this problem, and presumably it does so in some implementation)

link

srean 5274 days ago

I dont think I claimed that a vendor "cannot do things differently". Not sure where the downvotes came from, so just clarifying.

I would also hazard a guess that the tuning that you mention does not involve a change in the algorithm but are essentially reordering the steps of the algorithm to obtain better caching. I was responding to the parent post which conjectured that BLAS implementations use different algorithms. If you look at your two suggestions, none of them actually change the complexity class of the number of floating operations, but wall clock time oh absolutely.

Although there are matrix multiplication algorithms that have a complexity less than O(N^3) the constants for these are so large enough that the sizes of matrices for which there will be any appreciable benefit are extremely rare to come by.

link

cdavid 5274 days ago

Apple did not spent a considerable amount of time tunning BLAS - they just took Atlas, an widely used open source implementation of blas/lapack (and took care of making sure it does not look like ATLAS at first glance, although a simple look at nm will show it is atlas all the way down).

link

stephencanon 5274 days ago

The OS X BLAS was once ATLAS all the way down (and I see no reason to believe that any effort was made to hide that fact), but that's no longer true. A glance at nm on Lion shows that there's some ATLAS in there, but quite a lot of not-ATLAS too.

link

cdavid 5274 days ago

It is mentioned nowhere in accelerate framework that it uses atlas, and they removed functions like ATL_buildinfo. I doubt it has changed much in the last 5 years as well, and I am not sure what makes you think there is ( significant) non-atlas code in there ? Besides symbols starting with ATL, I see mostly the BLAS/LAPACK API, although I did not try to check very carefully.

link

stephencanon 5273 days ago

Besides the ATL-prefixed symbols, there are are numerous APL-prefixed symbols, as well as many _block_ symbols which indicate extensive use of GCD in the level 3 routines (and which ATLAS doesn't use).

link

stephencanon 5274 days ago

The gemm -> syrk optimization is very common (I would expect most tuned BLAS libraries to take advantage of it -- all that I have inspected do so).

link