Regardless of the patent issue, emulating AVX on the M1 (which only has 4-wide SIMD) would actually be significantly slower than forcing the x86 application to use it's SSE fallback path and emulating that.
Emulating AVX via Rosetta should be just as fast as re-compiling the original without AVX support and then emulating it. Emulating larger SIMD instructions is very easy, you just use multiple smaller SIMD instructions.
On the other hand, disabling AVX for all Intel machines would make those programs significantly slower, so it's clear why there is reluctance to do that...
No. For many algorithms, AVX isn't a 2x speedup over SSE. Especially when lanes are conditionally masked.
Often you are happy to get a 1.25x speed up with AVX. Sometimes it actually goes slower.
If you were to emulate that code with a 1.25x speedup with AVX on the M1, you would end up with all the disadvantages of going to 8-wide, but with none of the speedup.
That 1.25x speedup is halved and the emulated AVX code actually runs at about 0.625x the speed of the emulated SSE code path.
plus doesn't the M1 have specialized hardware. what's the neuro-engine or whatever it is that they call it for speeding up ML? i imagine at it's core it's a bunch of instructions for doing vector operations.
side bar: is there documentation for the instruction set or abi for that hardware?
The Apple Neural Engine is separate from the CPU; it's not additional registers and instructions for the CPU, like a vector unit. You go through the Core ML framework to use it, just like you go through Metal or OpenGL to use the GPU.
The value of SIMD on a CPU these days is really a middle-ground where you value latency above throughput, so you probably would have the same trade off as getting the data to and from a GPU
That definitely changes the calculus, but as I've mentioned in a different comment there doesn't seem to be literally any microarchitectural documentation to read, so I (don't own an M1) have nothing to go off unfortunately.
I'll make a wild guess that getting data to the neural engine is still probably not quick because I assume it's some kind of statically scheduled type affair (exposed pipeline?). We literally seem to know almost nothing about it sadly.
On the other hand, disabling AVX for all Intel machines would make those programs significantly slower, so it's clear why there is reluctance to do that...