But doesn't the Apple M series NPU support FP8, and as it's a monolithic die (except for the GPU in the M5 Pro and Max) it could be argued it has hardware FP8 support, no?
By that logic, on the M4 (which still has the GPU on the same die as the CPU), CPU cores have hardware accelerated raytracing, which is obviously nonsense.
Apple's hardware does not support FP8 (neither the ANE NPU, or the new "neural accelerator" tensor cores), though the most recent variant supports INT8.
If M5 has 9-18 cores and takes ~20w, then that's ~1-2w per CPU core. If these are 200-300W, and have ~100-200 CPU cores, then guess what? That's also ~1-2w per CPU core.
Xeons, Epycs, whatever this is - they are all also typically optimized for power efficiency. That's how they can fit so many CPU cores in 200-300W.