Hacker News new | ask | show | jobs
by gavinray 1050 days ago
The Java code is impressively written, using newer features like MemorySegment.

Looked at the author and realized it's Alfonso from the Graal team -- makes sense.

I wonder whether the "matmul" code could be further optimized with the Vector API and SIMD.

2 comments

Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.
Also new virtual threads might be beneficial. I was experimenting using Vector api for matrix multiplication once and effect was pretty good.
Virtual threads shouldn't help as the program isn't I/O or wait bottlenecked. It's a pure computation, so it's all about vectorization here.