I believe because the C# version has been written using rectangular arrays. This requires every array access to use a multiplication. The Java version uses array-of-arrays and hoisting the inner array out before accessing it in the inner loop.
C# also has arrays-of-arrays, and could (should) be written in the same manner.
I've just done this and it has been merged. The benchmarks table and image haven't been updated yet. But this should bring the C# result to ~2s instead of 4.67s
Unfortunately it doesn't. The newest and hottest way to do this is to either use bespoke matmul from System.Numerics.Tensors or at least using Vector<T> for SIMD (which is trivial and not "the last mile" optimization it often seems to be).
C# also has arrays-of-arrays, and could (should) be written in the same manner.