Hacker News new | ask | show | jobs
by howeman 3192 days ago
The algorithms are (basically) equivalent, and are translations from the Fortran (though row major instead of column major). As far as I know there are no major differences in the answers, though for extremely poorly conditioned matrices (1e14 or so) you shouldn't expect consistent answers across any implementation.

The performance story is complex. Typically we're the same speed on small matrices (and using Go is faster if you include the cgo overhead). We currently have significant speed penalties on large matrices (300x300 or so), but Kunde21 is working on assembly kernels for the BLAS functions to close that gap

1 comments

I'm surprised your performance is anywhere near that of standard BLAS implementations. The Golang compiler doesn't have support for explicit SIMD or auto-vectorization, so that's a big performance gain just sitting there.
For small vectors and matrices the cgo overhead swamps the assembly speedups. For large vectors cache misses dominate, and the assembly doesn't matter as much. It does matter significantly for medium vectors and large matrices. In that case we provide cgo wrappers and are working on SIMD kernels.