It could also be that x86 has better SIMD support by a considerable margin, which can make repetitive memory access/serialized workloads a lot faster. I'm not super well intimated with those extensions, but I know that the NEON SIMD implementation in ARMv8 leaves quite a bit to be desired. It's a tricky situation, and one I don't see resolving in a nice clean way. It's stuff like this that makes me hopeful for RISC-V though, where we could theoretically have our cake and eat it too, with dynamic instruction pipelines and incredibly low power usage. Only time will tell, I suppose.
Throughput is not an issue on the M1, with 4x 128-bit SIMD units.
Neon is certainly not a bad SIMD ISA, it's a quite orthogonal one.
You also have the AMX extension at hand, which is more special purpose but allow to deliver very high throughput. (on a regular M1: 350Gflops DGEMM, 1.2Tflops SGEMM, without leveraging anything other than the CPU)