|
|
|
|
|
by raphlinus
891 days ago
|
|
This is for Cortex A8, which was the chip in the Nexus One. I wrote the original version of sound synthesis directly in ARM assembler[1]. It was very highly optimized, I remember using a cycle counting app that flagged any dependency chain that would cause the processor to stall, and ultimately utilization was in the 90%+ range. Back in those days, processors were simple enough you could do this kind of optimization by hand. By the time of Cortex A15 (Nexus 10 etc), instruction issue was out-of-order and much harder to reason about. The best current info I could find for the latency advice is [2]. Quoting, "Moving data from NEON to ARM registers is Cortex-A8 is expensive." Looking at [3] partially reveals the reason why: the NEON pipeline is entirely after the integer pipeline, so moves from integer to NEON are cheap, but the reverse direction is potentially a large pipeline stall. This is an unusual design decision that as far as I know is not true for any other CPUs. Edit: I found [4], which is a more authoritative source. [1]: https://github.com/google/music-synthesizer-for-android/blob... [2]: https://community.arm.com/support-forums/f/armds-forum/757/n... [3]: https://www.design-reuse.com/articles/11580/architecture-and... [4]: https://developer.arm.com/documentation/den0018/a/Optimizing... |
|
For Cortex-A8 from [4] and the others you have linked, It makes sense to me now regarding the instruction passing data between registers, filling out the pipeline and then stalling.
Will have a peek at ARMv8/ARMv9 arch's and see what they did there regarding SVE/SVE2.