|
|
|
|
|
by dvas
891 days ago
|
|
Got me curious regarding ARM latency, wonder if that was related to particular instructions which added more latency or transfers between the registers/memory subsystem internals. Also on the off-chance that you remember, did you inline intrinsics or let the compiler auto-optimize? Interesting to test out on the ARM Mac, and see if different dependency chains show significant latency penalties / in with reorder buffer. |
|
The best current info I could find for the latency advice is [2]. Quoting, "Moving data from NEON to ARM registers is Cortex-A8 is expensive." Looking at [3] partially reveals the reason why: the NEON pipeline is entirely after the integer pipeline, so moves from integer to NEON are cheap, but the reverse direction is potentially a large pipeline stall. This is an unusual design decision that as far as I know is not true for any other CPUs. Edit: I found [4], which is a more authoritative source.
[1]: https://github.com/google/music-synthesizer-for-android/blob...
[2]: https://community.arm.com/support-forums/f/armds-forum/757/n...
[3]: https://www.design-reuse.com/articles/11580/architecture-and...
[4]: https://developer.arm.com/documentation/den0018/a/Optimizing...