Hacker News new | ask | show | jobs
by pcwalton 4244 days ago
Autovectorization has been an area of intense compiler effort for a decade or more and by and large the primary customers of it (games, video codecs, etc.) prefer the intrinsics. It's perceived as too unreliable and brittle to be relied upon, and it's easy to see why: given a choice between having to think about what the compiler's alias analysis, overflow analysis, loop trip count analysis, etc. will do and just writing an intrinsic and calling it a day, programmers will choose the latter.

This applies regardless of how good the autovectorization really is: it's in a weird catch-22 kind of space where adding more and more features to your autovectorizer can actually reduce its perceived reliability, by making the answer to "will this vectorize?" harder and harder for a programmer to answer at a glance. <xmmintrin.h> has a lot of problems, but it's reliable, and at the end of the day that's what history has shown that game devs and video codec authors want.

6 comments

"Autovectorization has been an area of intense compiler effort for a decade or more and by and large the primary customers of it (games, video codecs, etc.) prefer the intrinsics. It's perceived as too unreliable and brittle to be relied upon, and it's easy to see why: given a choice between having to think about what the compiler's alias analysis, overflow analysis, loop trip count analysis, etc. will do and just writing an intrinsic and calling it a day, programmers will choose the latter."

100% true for C++ (though it would be more accurate to say "4 decades" if you want to count fortran autovectorization, which has been going on since the late 70's)

But, i'll point out, plenty of the time, they end up writing slower intrinsics than the compilers autovectorization did to the same code.

(Plenty of the time they don't, too).

Additionally, all of the problems you mentioned are due to specific issues in C/C++. In other languages, autovectorization is not just "relied upon", it's basically "part of the standard" (see, e.g., Fortran 95).

Given that all of the brittleness you talk about is precisely because of the lack of pointer safety, alignment issues, and all sorts of things that simply only exist in C/C++, where programmers have a lot of control, i'm not sure it makes sense to base your argument on the experience of a language that is very different from the one this API was designed for.

All that said, truthfully, IMHO, neither autovectorization, nor intrinsics at the level you are talking, make for a good programming model in most languages.

The intrinsics at this level don't get used effectively: Among other reasons, they codegen differently on different platforms that don't directly have the exact same simd semantics, which is "all of them" :P

I know you guys are trying to avoid this by limiting the ops available/etc. It is, IMHO, a losing game.

So you end up with the same problem: People write loops that are really bad on some platforms, and good on others.

Autovectorization knows what the target looks like, but doesn't trigger in some cases people want it to.

In the end, I think doing things like Halide is a lot more useful as a programming model than simd.js

simd.js is a usable implementation mechanism for some of those programming models, but i would not sell it as the programming model itself.

In fact, almost the exact set of intrinsics mentioned in simd.js were allowed for generic operations on vectors in GCC (you can create a vector 32x4 float in a platform independent way, do normal ops on it, and it will codegen down to lower level vector ops, without ever seeing xmmintrin). It was simultaneously not high level and not low level enough.

People resorted to the lower level platform specific intrinsics to get better performance, or wrote higher level libraries to get better abstract.

In any case, i'm sure it's faster than what you have now, and certainly an advance. I'd just be careful of thinking it's going to work all that well except for targeted use cases.

Yes, we don't expect everyone will want to program to the bare SIMD.js API directly for everything; it's also intended to provide basic functionality that higher-level libraries and even specialized languages, like Halide, can be built on (when they aren't running on ARB_compute_shader).
> (games, video codecs, etc.) prefer the intrinsics

Nit: Most of the projects I'm familiar with (libav/ffmpeg, x264, etc) prefer to break out the SIMD into hand-written functions, instead of relying on intrinsics or even inline asm. This avoids problems with register allocation and code gen, consistency/portability between compilers, etc.

Otherwise, yes, autovectorization is hard, both for application developers and compiler writers. Application code needs to be structured in a very precise way, and the correctness of C -> SIMD transformations needs to be proven. Intrinsics and hand-written SIMD aren't going away.

> This avoids problems with register allocation and code gen, consistency/portability between compilers, etc.

Yup. Which just goes to show: reliability is king.

I appreciate that autovectorization is hard to do well. However, I think the world of JS is different from the world of C/C++. JS optimization is already pretty unpredictable, since the language is dynamic (both in terms of typing and e.g. object memory layout, with the exception of typed arrays); JS primitives are farther from the metal; the optimizations that JS engines perform are implementation-specific, rarely well-documented and always in flux; and it's difficult to see what machine code actually runs for a given JS function. SIMD instructions may make some sense for JS as a compiler target, but they seem to make less sense for JS as a language that doesn't have integers or 32-bit floating point numbers. On top of that, most users of vectorization are targeting a specific architecture or even CPU, whereas JS code is meant to run anywhere. It doesn't seem like there's been much work to alter the language or tools to make it easier for programmers to reason about other sources of unpredictability, so why so much emphasis on SIMD?

I'll admit some ignorance here, but it also seems to me that a JIT may also have some advantages WRT autovectorization as compared with a static compiler, since you can collect runtime information about aliasing and loop trip count before choosing to vectorize. But if the point is to make performance easier to reason about, why not start with the rest of the language before worrying about vectorization?

> It doesn't seem like there's been much work to alter the language or tools to make it easier for programmers to reason about other sources of unpredictability, so why so much emphasis on SIMD?

But there certainly have been such efforts! Standards bodies have added features like Typed Arrays, Math.fround, etc., and work is ongoing on Classes, Typed Objects, and Modules. All of those things make performance more predictable.

There are also better devtools all the time, which help you understand performance issues better.

And there is also asm.js which aims to make a certain type of JavaScript extremely predictable.

A final point - the unpredictability you mention is exactly why a SIMD API is needed. JavaScript is more unpredictable than C and C#, but even those have added SIMD APIs, because even in their predictable worlds, autovectorization wasn't good enough.

The key is that Mozilla is betting hard on Emscripten/asm.js building marketshare/mindshare into the future, and being perceived as being exactly as performant and reliable as C/C++ running in a native process. SIMD.js should of course be able to run from non-asm.js code... but that's more of a bonus (since that's an almost-strict subset of the work required to get it to work on asm.js code, AFAIK).
> video codecs, etc.) prefer the intrinsics.

Prefer assembly. Intrinsics usually make a disaster of register allocation and you lose much of your performance to needless load/stores.

Well if it's between auto vectorization or intrinsics...

Lately I've been rather disappointed in how minimal the gains are in reducing register spills from intrinsics on modern CPUs, with their wide decode/issue, 16 registers, and dual load pipelines - by the time a loop is complex enough that a compiler spills, extra load/store uops are almost free from a micro benchmark perspective. The macro gains from smaller code and reduced cache usage are a bit bigger, but still depressingly minor for the effort expended.

But if you care about 32-bit x86 that's another story of course.

So, one of the real reasons reducing register spills does not help is not related to what you suggest, it's because on modern x86, they play games with what looks "memory" to you, so you really aren't actually spilling into "memory" anyway :)
You (and a lot of people) make it sound like its magic but it's not - http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation...
It's not magic. But it's not what that blog post is talking about.

On some of these processors, 128 bytes of stack or so is not really "memory" (in the sense of being stored with memory), so spilling is not that bad.

That's the magic I'm talking about because it's not true; memory is memory and stack memory isn't treated specially by the processor. What it does have is a store buffer, which applies to all memory accesses and is what store forwarding uses to bypass L1.
What about meeting the compiler in the middle? I like the matlab/numpy/blas approach. Ask the developer for the high level vector operation (i.e. vector addition, inner product, matrix multiplication...). And then have the library/runtime turn that into SIMD instructions.
> "will this vectorize?"

Yes, but _will it blend?_