Hacker News new | ask | show | jobs
by champtar 2350 days ago
One of the best 2h practical course that I had was just write the fastest square matrix multiplication. You could use any language, any algorithm, just no libraries. The target was a 32 core CPU server (this was ~10 years ago). At 5000x5000 all the Java and Python attempts were running out of memory. In C, We tried some openmp, some optimized algorithm, but in the end the best trick was to flip one of the matrix so that memory could be always prefetched. Out of curiosity another student tried GNU Scientific Library, it turned out to be ~100 times faster. My take away was find the right tool for the job!

A fun read on cloud scale vs optimized code is this recent article comparing ClickHouse and ScyllaDB (https://www.altinity.com/blog/2020/1/1/clickhouse-cost-effic...)

4 comments

Yeah, I wouldn't be surprised if the majority of code performing large matrix multiplications these days was written in Python and executed on GPUs by libraries like Tensorflow and PyTorch. With the right abstractions, programmers can be "lazy" and still get great performance.
Matrix multiplication is usually done by a platform-specific BLAS library (BLAS is an API, there are multiple implementations, e.g. Intel MKL, OpenBLAS, cuBLAS). There are some other linear algebra APIs/libraries, but this is what's used the most.

Most of the numerical code that cares about performance for linear algebra uses this API and links an appropriate implementation.

The 'written in Python' you speak of is actually Fortran under the hood.
reminds me of the dude who managed to parse TB with awk instead of whatever spark like product was trendy
Leaving aside the optimisations, I assume you are doing N^3 multiplication, whereas Strassen algorithm with complexity N^2.81 or even Coppersmith–Winograd algorithm with complexity N^2.37 with larger constant is better with 5000x5000 square matrix.
A double array takes O(1) more space in Java vs. C, so running out of memory wouldn't be a problem.
Unfortunately big-O algorithmic analysis doesn't translate cleanly to the real world due to constants and other inefficiencies.
It does in this case. Champtar says it's something about the size of the matrix that makes it run out of space. The overhead of the double array is already paid for.