| HN Mirror

Am I sure about what? If it is auto-vectorizing? Yes. If the performance difference at O2 for both compilers is that dramatic? Yes. If the vectorization is the ultimate difference in the performance? No, not really.

I looked at the disassembly with objdump. I tend to build with both clang and GCC regularly, for some reason I like comparing them. Since I'm sending many rays and bounces, a 50% reduction in time is very noticeable, so I looked at the generated code. I mentioned the GCC version because it is slightly unfair to compare a very new clang to GCC from a few years back. The GCC output has some vectorization as well, but the clang output seems to generate smaller code with more vectorization. It would be interesting to compare it side-by-side on godbolt, but I'd have to cut-and-paste a bunch of files to do so, and it's not a priority at the moment.

Maybe I should have responded to another comment here. The intention of my previous comment was to bolster the idea that more typical naive and less-optimized code might benefit more than already-optimized code like in the article. 3d math in general is obviously a domain that can benefit from vectorization more than most.

Another fun find, was that sharing the PRNG state among threads destroyed performance. I have other higher priority side-projects, so I haven't had a chance to investigate why yet. Whether it was something like the cache-line bouncing between cores (I wouldn't be surprised if the PRNG was the hottest code in the whole program), or a cascading effect on the generated code. A lot of my code is visible to the compiler for the ray tracing hot path, so it's also possible it broke inlining or some other compiler optimizations.