This is responsible for Ryzen losing to Intel in SIMD heavy benchmarks. The upside is that it avoids the reduced turbo boost Intel does for some 256-bit AVX instructions (and even worse downclocks caused by AVX 512), so for workloads mixing avx and normal instructions it shouldn't do too badly.
Is there a good resource that explains the difference between those ("decode" vs. "trapping") on a modern CPU? When I see "trap" I imagine the kernel catching illegal instruction exceptions and emulating them in software, but it doesn't seem like that's what you mean?
Modern CPUs execute micro ops, RISC-like instructions (e.g. load from memory address to register, add two registers, store from register to memory address). The CPU's "decode" stage translates x86 instructions into micro-ops (often 1-to-1, but x86 compare followed by x86 jump are translated into a single micro-op, while some x86 instructions are translated into multiple micro-ops).
On one CPU model a x86 operation like "256bit add" might translate into "256bit add" micro-op, and on another model the same x86 operation might be translated into a series of micro-ops like "128bit add, wait a cycle for the 1st add to finish, pass the carry bit into a 2nd 128 bit add", because that model doesn't have a real 256bit adder. So the latency of the operation is 2 cycles, but nothing else is changed.
Some x86 instructions might be very complicated and cannot be translated into a fixed-length series of micro-ops using a template. For example, the integer division, square root or the string compare machine instructions might be loops with conditionals in them and don't run the same amount of micro-ops every time. They can be implemented by Intel using a program written in micro-ops. Intel stores this program in flash on the CPU and the decoder knows to run that program when encountering the instruction. The OS doesn't need to help here, this is not emulation or software-floating point, it's just that the single instruction takes 200 clock cycles. What this does to the out-of-order engine is another story. These "programs", called microcode, can have bugs and newer versions of microcode updates, sent to the CPU at boot by the BIOS/UEFI and/or by the OS, update them.
https://www.agner.org/optimize/blog/read.php?i=838#838
This is responsible for Ryzen losing to Intel in SIMD heavy benchmarks. The upside is that it avoids the reduced turbo boost Intel does for some 256-bit AVX instructions (and even worse downclocks caused by AVX 512), so for workloads mixing avx and normal instructions it shouldn't do too badly.