Hacker News new | ask | show | jobs
by uep 1961 days ago
I have a simple C++ raytracer I wrote by going through Ray Tracing in One Weekend. I have not even made an attempt to optimize it. I really only made it parallel by splitting it up into tiles.

Clang 10 was able to automatically vectorize the code, so it performs >2x as fast as GCC 8.3. To be fair to GCC, I'm using my distro's GCC, but I built a newer Clang for C++ coroutine support.

1 comments

Are you sure? Modern clang and gcc both have auto-vectorizers. clang's is enabled by default.[1] gcc requires '-ftree-vectorize'[2]. For my use case, I've seen the most improvements with clang + openmp + polly, requiring code changes along with hinting. Good news if your analysis is correct.

As far as the article, I'm surprised Cache and Meshlets are 5% slower in 11 than 2.7. Some insight could be gained as to what caused this regression.

[1] https://llvm.org/docs/Vectorizers.html

[2] https://gcc.gnu.org/projects/tree-ssa/vectorization.html

Am I sure about what? If it is auto-vectorizing? Yes. If the performance difference at O2 for both compilers is that dramatic? Yes. If the vectorization is the ultimate difference in the performance? No, not really.

I looked at the disassembly with objdump. I tend to build with both clang and GCC regularly, for some reason I like comparing them. Since I'm sending many rays and bounces, a 50% reduction in time is very noticeable, so I looked at the generated code. I mentioned the GCC version because it is slightly unfair to compare a very new clang to GCC from a few years back. The GCC output has some vectorization as well, but the clang output seems to generate smaller code with more vectorization. It would be interesting to compare it side-by-side on godbolt, but I'd have to cut-and-paste a bunch of files to do so, and it's not a priority at the moment.

Maybe I should have responded to another comment here. The intention of my previous comment was to bolster the idea that more typical naive and less-optimized code might benefit more than already-optimized code like in the article. 3d math in general is obviously a domain that can benefit from vectorization more than most.

Another fun find, was that sharing the PRNG state among threads destroyed performance. I have other higher priority side-projects, so I haven't had a chance to investigate why yet. Whether it was something like the cache-line bouncing between cores (I wouldn't be surprised if the PRNG was the hottest code in the whole program), or a cascading effect on the generated code. A lot of my code is visible to the compiler for the ray tracing hot path, so it's also possible it broke inlining or some other compiler optimizations.