Again, inline ASM is pretty rare these days (when we do use it, it isn't for SIMD). Intrinsics are much more common.
The big issue (aside from convincing MSVC to implement it ;) with your suggestion is that, unlike TCO, vectorization isn't really a boolean. There's a range of what vectorization might mean (you can vectorize code and do a bad job with it, only marginally beating out the scalar code), so you'd still need to check the generated code for the situations where you care.
And honestly, it's not worth the effort. Vectorization shouldn't be as scary as it is for most programmers. Once you get the hang of how to do it, it's not bad at all. We write a lot of SIMD code at work, and 'difficulty of writing SIMD code' isn't a big issue for us. Honestly, it's kind of fun, a bit like solving a puzzle (an optimization puzzle, something like SpaceChem or Infinifactory).
Now, a situation where it might be a win is if you have a lot of different platforms you need vectorized code for... but in my experience you're probably better off doing it by hand unless this is a huge number.
Writing SSE code using compiler intrinsics is indeed a fun puzzle, but it has huge drawbacks: 1) it's Intel-specific, and 2) it's a maintenance risk unless everyone in the shop knows how to write and maintain SSE code. Unfortunately nobody else at my job knows how to do it, so I am not allowed to check any in :(
> it's a maintenance risk unless everyone in the shop knows how to write and maintain SSE code.
This is understandable, but not the case where I work. If you need to be writing SIMD code and this is the case, then you need to hire programmers who can do it. That or convince them to learn, as (again) it's not that hard.
I've been thinking about that for the last few days, and I think that's the best solution, if it's possible. Optimizations might clutter up the code and make the intent not clear. Writing idiomatic code and hoping that the compiler figures it out is also suboptimal, as noted by the grandparent.
I think the best solution is to be able to make some kind of annotation, or other way of declaration, on a function that says "this function should be no worse than this". In Scala's case, for example tailrec. I'm unfortunately having a hard time with coming up with other, specific examples, but the gist of it is that the compiler either manages to do all the work on the function itself and the functions that that function calls, or errors out and reports what it couldn't do. Ideally I would want to make 10 functions which are all pure and referentially transparent, call all those functions from a top function with some kind of annotation that gives some demands with regards to optimizations, and then have that function be transformed to a single, efficient, fused loop with no allocations or intermediary values that are unnecessary. But like I mentioned, the hard part seems to be in actually specifying what your demands are.
In my experience, I always write a clear, idiomic C code along with my intrinsics-based vectorized function, alo png with comment on tricky part (such as using packus for clamping thing to 0-255). You got clear code, and also optimized code (you might want C code anyway for pre-sse2/mmx and non-x86). Downside is that you are to maintained multiple copy of code, but if you have the version optimized for each sse2/sse3/sse4.1/avx2 anyway it is not really that more hassle.
The big issue (aside from convincing MSVC to implement it ;) with your suggestion is that, unlike TCO, vectorization isn't really a boolean. There's a range of what vectorization might mean (you can vectorize code and do a bad job with it, only marginally beating out the scalar code), so you'd still need to check the generated code for the situations where you care.
And honestly, it's not worth the effort. Vectorization shouldn't be as scary as it is for most programmers. Once you get the hang of how to do it, it's not bad at all. We write a lot of SIMD code at work, and 'difficulty of writing SIMD code' isn't a big issue for us. Honestly, it's kind of fun, a bit like solving a puzzle (an optimization puzzle, something like SpaceChem or Infinifactory).
Now, a situation where it might be a win is if you have a lot of different platforms you need vectorized code for... but in my experience you're probably better off doing it by hand unless this is a huge number.