|
|
|
|
|
by TheTon
1387 days ago
|
|
The article matches my own experience porting some media processing code from SSE to NEON for the Apple Silicon transition. I had a library with C and SSE implementations and I wanted to write a NEON implementation with the goal of outperforming the C version (on any arch), as well as the SSE versions running native (on an Intel CPU), ported (via a intrinsic compatibility library), and runtime translated (via Rosetta 2). I started studying the SSE code. This ended up not being useful and even counter productive. I only began to make good progress when I let myself forget what I knew about the SSE implementation and instead used the C code as a starting point. By letting myself back up and think about what the code was actually doing at a high level, and then thinking about how best to write that in NEON, I was able to come up with quite different approaches compared to the SSE code, and in the end the NEON version was much faster. |
|