Doesn't your PR monomorphize the function every time N is changed ? I realize it's simpler since it keeps the structure, simplifies allocation of arrays, and elides bound check. But it explodes generated code size and matrix size can't be changed at runtime, which doesn't really match C.
Edit: I have tried making an iterator-based version to elide bound checks, but had to resort to unsafe, and it's barely 50% faster than the original rust version (not as fast as C): https://gist.github.com/anisse/6b580628206293ef242faa7db6219...
Edit 2: updated, and my rust iterator version now ~equivalent to C with no unsafe.
Edit 3: too late, the repo has been updated with an other iterator-based version that is just as fast.