| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by a_e_k 4635 days ago

Hi there, speaking of autovectorization, I actually tried it on this last night. After seeing kid0man's post, I decided to try optimizing it and selected that same loop as my target. (When I wrote the original C++ program, I was favoring conciseness and portability over performance, naturally.)

I made many of the same transformations as you did: switching the object's data to a structure of arrays, splitting out the computation of the normal from the loop, etc. (even an int hit = -1.) My goal was to coax Intel's compiler into autovectorizing that loop, without directly using vector intrinsics. I succeeded, but the result turned out to be noticeably slower than just compiling kid0man's with -fast. Part of that, I suspect is that it generated suboptimal code for the reduction over the minimum, where a human programmer would have used a movemask as you did.

That said, I'm fairly curious to experiment with seeing how it would perform with the kernel compiled via ispc [1].

Regarding direct use of SIMD intrinsics in inner loops, I have to agree with you that it's still reasonably common for this type of thing. I've certainly done it before in ray tracing contexts [2], and I've seen many others do it as well, e.g. [3] and [4]. Autovectorization and things like Intel's array notation extension [5] seem to be getting better all the time, but I don't think it's generally as performant yet as direct use of intrinsics. In the cases where it is, it usually seems to have taken a fair amount of coaxing and prodding.

[1] http://ispc.github.io/

[2] http://www.cs.utah.edu/~aek/research/triangle.pdf

[3] https://github.com/embree/embree

[4] http://visual-computing.intel-research.net/publications/pape...

[5] http://software.intel.com/en-us/blogs/2010/09/03/simd-parall...

1 comments

abelsson 4635 days ago

As an intermediate step, before starting with the SSE intrinsics I rewrote the code in a form that should have been reasonably suitable for an autovectorizer (an inner loop over a fixed number of elements - I imagine it probably looked fairly similar to your code), but my gcc with -ftree-vectorize didn't do much with it. I didn't really explore that path further though.

I actually did a version which did the reduction over minimum purely using SIMD and then a post step which reduced the SIMD minimums to a single scalar. It was somewhat tricky to get the index right, and in the end it turned out to not be faster (at least not for the little example of 32 objects, I imagine you would gain something on a more complex scene)

Anyway, it was a fun little exercise and it has sparked some interesting discussion. Thanks for posting the original.

link