The latest Intel architecture (Sapphire Rapids) support it without downclocking. AMD Zen 4 also supports it, although their implementation is double pumped, not sure what the real world performance impact of that is.
There is a huge confusion about this "double pumped" thing.
All that this means is that Zen 4 uses the same execution units both for 256-bit operations and for 512-bit operations. This means that the throughput in instructions per cycle for 512-bit operations is half of that for 256-bit operations, but the throughput in bytes per cycle is the same.
However the 512-bit operations need fewer resources for instruction fetching and decoding and for micro-operation storing and dispatching, so in most cases using 512-bit instructions on Zen 4 provides a big speed-up.
Even if Zen 4 is "double pumped", its 256-bit throughput is higher than that of Sapphire Rapids, so after dividing by two, for most instructions it has exactly the same 512-bit throughput as Sapphire Rapids, i.e. two 512-bit register-register instructions per cycle.
The only exceptions are that Sapphire Rapids (with the exception of the cheap SKUs) can do 2 FMA instructions per cycle, while Zen 4 can do only 1 FMA + 1 FADD instructions per cycle, and that Sapphire Rapids has a double throughput for loads and stores from the L1 cache memory. There are also a few 512-bit instructions where Zen 4 has better throughput or latency than Sapphire Rapids, e.g. some of the shuffles.
All that this means is that Zen 4 uses the same execution units both for 256-bit operations and for 512-bit operations. This means that the throughput in instructions per cycle for 512-bit operations is half of that for 256-bit operations, but the throughput in bytes per cycle is the same.
However the 512-bit operations need fewer resources for instruction fetching and decoding and for micro-operation storing and dispatching, so in most cases using 512-bit instructions on Zen 4 provides a big speed-up.
Even if Zen 4 is "double pumped", its 256-bit throughput is higher than that of Sapphire Rapids, so after dividing by two, for most instructions it has exactly the same 512-bit throughput as Sapphire Rapids, i.e. two 512-bit register-register instructions per cycle.
The only exceptions are that Sapphire Rapids (with the exception of the cheap SKUs) can do 2 FMA instructions per cycle, while Zen 4 can do only 1 FMA + 1 FADD instructions per cycle, and that Sapphire Rapids has a double throughput for loads and stores from the L1 cache memory. There are also a few 512-bit instructions where Zen 4 has better throughput or latency than Sapphire Rapids, e.g. some of the shuffles.