Yes, I'm aware. The service in question wasn't easily able to be moved so we moved to m6i which isn't ARM based but does leverage nitro. We saw substantial improvements in that configuration too. Not sure what is different because you said m5 use nitro as well but my assumption was m6i with reduced hypervisor overhead from nitro was why we saw improvement.
m6i is a much newer CPU architecture, based on Intel Ice Lake rather than Skylake. It is quite significantly faster just from that alone. In addition, the CPU has about 10% higher clock speed.
The 16xlarge version is also a 32 core single socket CPU, meaning there should be no issues with NUMA. I would expect it to be much better than m5.24xlarge in most applications when taking the much faster single-threaded performance into account. Of course, nothing beats benchmarking and measuring yourself though.
I have personally seen issues with NUMA systems and code that theoretically parallelizes very well. Any synchronized mutable state becomes an issue with these kinds of systems. For example, I have had an issue where third party code would use the C "rand" function for randomness. Even though this was not used in a hot code path, on m5.24xlarge >90% of the execution time would be spent on just on the lock guarding the internal random state. On a "normal" system with fewer cores this never showed up while profiling.