| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by abelsson 4636 days ago

I'll grant you that - AVX is pretty uncommon. I originally wrote it with SSE2 (there's a working version in the next to last commit on the github repo), but rewrote it using AVX because.. well, I hadn't used it before.

But I wouldn't say writing media and signal processing inner loops using SIMD intrinsics is uncommon. The style of optimization I illustrate is pretty common, perhaps minus the AVX code path. Most widely used video/image processing, ray tracing and other compute bound libraries will probably be SIMD optimized in some fashion (probably with different code paths for different processors). You gain 1-8x performance, which is pretty significant. It's on the same order of magnitude speedup as threading your program.

I have yet to see anyone truly and systematically trusting automatic vectorization, but perhaps there are libs out there I've missed. Anyone know of some?

3 comments

a_e_k 4635 days ago

Hi there, speaking of autovectorization, I actually tried it on this last night. After seeing kid0man's post, I decided to try optimizing it and selected that same loop as my target. (When I wrote the original C++ program, I was favoring conciseness and portability over performance, naturally.)

I made many of the same transformations as you did: switching the object's data to a structure of arrays, splitting out the computation of the normal from the loop, etc. (even an int hit = -1.) My goal was to coax Intel's compiler into autovectorizing that loop, without directly using vector intrinsics. I succeeded, but the result turned out to be noticeably slower than just compiling kid0man's with -fast. Part of that, I suspect is that it generated suboptimal code for the reduction over the minimum, where a human programmer would have used a movemask as you did.

That said, I'm fairly curious to experiment with seeing how it would perform with the kernel compiled via ispc [1].

Regarding direct use of SIMD intrinsics in inner loops, I have to agree with you that it's still reasonably common for this type of thing. I've certainly done it before in ray tracing contexts [2], and I've seen many others do it as well, e.g. [3] and [4]. Autovectorization and things like Intel's array notation extension [5] seem to be getting better all the time, but I don't think it's generally as performant yet as direct use of intrinsics. In the cases where it is, it usually seems to have taken a fair amount of coaxing and prodding.

[1] http://ispc.github.io/

[2] http://www.cs.utah.edu/~aek/research/triangle.pdf

[3] https://github.com/embree/embree

[4] http://visual-computing.intel-research.net/publications/pape...

[5] http://software.intel.com/en-us/blogs/2010/09/03/simd-parall...

link

abelsson 4635 days ago

As an intermediate step, before starting with the SSE intrinsics I rewrote the code in a form that should have been reasonably suitable for an autovectorizer (an inner loop over a fixed number of elements - I imagine it probably looked fairly similar to your code), but my gcc with -ftree-vectorize didn't do much with it. I didn't really explore that path further though.

I actually did a version which did the reduction over minimum purely using SIMD and then a post step which reduced the SIMD minimums to a single scalar. It was somewhat tricky to get the index right, and in the end it turned out to not be faster (at least not for the little example of 32 objects, I imagine you would gain something on a more complex scene)

Anyway, it was a fun little exercise and it has sparked some interesting discussion. Thanks for posting the original.

link

corresation 4636 days ago

But I wouldn't say writing media and signal processing inner loops using SIMD intrinsics is uncommon.

But at that point this has nothing to do with Go or C++, and I find this whole discussion rather disingenuous (at first I thought you were detailing the maturity of C(++) compilers and their superior support of auto-vectorization, which would be a reasonable angle): You can import the Intel math libraries and call them from Go (I know, as I do it regularly. See my submissions).

link

abelsson 4636 days ago

No, it doesn't. I tried to make that point too, but perhaps it didn't come across very clearly. For these kind of tight inner loops a language only ever gets in the way, the difference is really only how difficult it is to get rid of the conveniences you don't want. (The much vaunted zero-cost of features you don't use in C++ lingo I guess)

I still think that a systems programming language need to offer escape hatches, whilst striving towards ease of use in the common case. C++ has plenty of hatches, at the cost of horrific complexity.

But suppose I'm willing to pay the cost of writing my code in 5 different code paths for different processors for that extra 2-4x of performance. Very few languages offer that possibility, and most of those who do only offer to call a C library. I'm the guy stuck writing the Intel math libraries of the world, and I want something more reasonable to do it in.

link

cmccabe 4636 days ago

The point that corresation is making is that none of the optimizations you did had anything to do with C++. You could have easily done them for the Go version, but you didn't. Then you put up a chart and said "this is why C++ is better." Huh?

link

pjmlp 4635 days ago

Funny, where in the ANSI C++ standard is the entry for AVX/SSE2?

link