If I remember correctly, the most efficient algorithms have a ridiculously large constant factor, to the point where you don't have enough RAM to even store the matrix, let alone do anything with it.
That's not what GP is talking about. Those algorithms are the sub-cubic ones, the ones that are O(x^2.something). GP is talking about an algorithm that is still O(n^3), but is not just a straight translation of the matrix multiplication formula into code, but rather does things in a way that is more friendly to the CPU cache, resulting in significant speed gains.