Hacker News new | ask | show | jobs
by openasocket 3195 days ago
I'm surprised your performance is anywhere near that of standard BLAS implementations. The Golang compiler doesn't have support for explicit SIMD or auto-vectorization, so that's a big performance gain just sitting there.
1 comments

For small vectors and matrices the cgo overhead swamps the assembly speedups. For large vectors cache misses dominate, and the assembly doesn't matter as much. It does matter significantly for medium vectors and large matrices. In that case we provide cgo wrappers and are working on SIMD kernels.