|
|
|
|
|
by abelsson
4636 days ago
|
|
I'll grant you that - AVX is pretty uncommon. I originally wrote it with SSE2 (there's a working version in the next to last commit on the github repo), but rewrote it using AVX because.. well, I hadn't used it before. But I wouldn't say writing media and signal processing inner loops using SIMD intrinsics is uncommon. The style of optimization I illustrate is pretty common, perhaps minus the AVX code path. Most widely used video/image processing, ray tracing and other compute bound libraries will probably be SIMD optimized in some fashion (probably with different code paths for different processors). You gain 1-8x performance, which is pretty significant. It's on the same order of magnitude speedup as threading your program. I have yet to see anyone truly and systematically trusting automatic vectorization, but perhaps there are libs out there I've missed. Anyone know of some? |
|
I made many of the same transformations as you did: switching the object's data to a structure of arrays, splitting out the computation of the normal from the loop, etc. (even an int hit = -1.) My goal was to coax Intel's compiler into autovectorizing that loop, without directly using vector intrinsics. I succeeded, but the result turned out to be noticeably slower than just compiling kid0man's with -fast. Part of that, I suspect is that it generated suboptimal code for the reduction over the minimum, where a human programmer would have used a movemask as you did.
That said, I'm fairly curious to experiment with seeing how it would perform with the kernel compiled via ispc [1].
Regarding direct use of SIMD intrinsics in inner loops, I have to agree with you that it's still reasonably common for this type of thing. I've certainly done it before in ray tracing contexts [2], and I've seen many others do it as well, e.g. [3] and [4]. Autovectorization and things like Intel's array notation extension [5] seem to be getting better all the time, but I don't think it's generally as performant yet as direct use of intrinsics. In the cases where it is, it usually seems to have taken a fair amount of coaxing and prodding.
[1] http://ispc.github.io/
[2] http://www.cs.utah.edu/~aek/research/triangle.pdf
[3] https://github.com/embree/embree
[4] http://visual-computing.intel-research.net/publications/pape...
[5] http://software.intel.com/en-us/blogs/2010/09/03/simd-parall...