... in which case she would have used a lookup table, unless she was a really old-school firmware hacker and still believed that you couldn’t justify 256B for the table. (FWIW, you’re right about ARMv6T2).
I am. 256 bytes pain me :) (I've worked on systems with that much RAM total)
Kidding aside, I'd probably not go for the lookup table unless the whole thing was necessary in an inner loop - the cache miss cost is high. And since it's in a function, it better not be in an inner loop :)
Surely there is an argument that if it's called infrequently enough that cache misses on the LUT are a problem, then it's also called so infrequently that its performance is irrelevant.
256B is quite large for sure, but what about going for 2 nibble lookups in a 16 byte table? Or a 2 bit swap and a 6 bit lookup in a 64B table? (current x86-64 CPUs typically have 64B L1 cache lines?)
8 words to an L1 line on the original iPhone ARM, so yes, you'd fit it into a smaller table. You'll still face the memory latency issue if you're using this outside a tight loop. (Cache is only 4-way set associative. But at least I & D cache are separate)
But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.
Why yes, yes I do miss fiddling around with low level details. :)
If you're talking to a dev with a background in firmware dev, I'd say you'd be hard pressed to find any body who'll put a single asm instruction into a separate function if it's speed-critical.
Sure, the compiler might (and probably will) inline, but it means any change in your tool chain or a whim of the inline heuristic can cause serious performance regressions that are completely avoidable.
When it comes to performance, the embedded mindset will always be "belt and suspenders" ;)
Kidding aside, I'd probably not go for the lookup table unless the whole thing was necessary in an inner loop - the cache miss cost is high. And since it's in a function, it better not be in an inner loop :)