Hacker News new | ask | show | jobs
by peterhj 2593 days ago
If you have an assembler or C compiler you can implement matrix multiplication (GEMM) which usually does most of the heavy lifting in your neural net. Now you correctly alluded that it may not be simple to efficiently implement GEMM but if you have a simple architecture without a complex memory hierarchy then using whatever SIMD facilities are available and some standard tricks will get you in ballpark of peak FLOP/s.

Or, just download a fast BLAS from your hardware vendor...