Hacker News new | ask | show | jobs
by KaeseEs 4700 days ago
Great analysis, although I'm curious how the idea that doing a bunch of 64 bit ops in order to accomplish byte arithmetic came about to begin with - was the function in question not written by a firmware guy?
2 comments

I think this one goes back to PDP days and wasn't necessarily written to be the fastest possible implementation. The PDP could do 36*36 multiply into 72 bits. Not sure how the modulo instruction performed but there was a DIV instruction.
Down the rabbit hole says this came from HAKMEM No. 239 in 1972!

http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167

Or the programmer was a firmware hacker, and she knew that RBIT is ARMv6T2, which IIRC wasn't available until the iPhone 3GS. (Not 100% certain, and I don't have my manuals handy)
... in which case she would have used a lookup table, unless she was a really old-school firmware hacker and still believed that you couldn’t justify 256B for the table. (FWIW, you’re right about ARMv6T2).
I am. 256 bytes pain me :) (I've worked on systems with that much RAM total)

Kidding aside, I'd probably not go for the lookup table unless the whole thing was necessary in an inner loop - the cache miss cost is high. And since it's in a function, it better not be in an inner loop :)

Surely there is an argument that if it's called infrequently enough that cache misses on the LUT are a problem, then it's also called so infrequently that its performance is irrelevant.
256B is quite large for sure, but what about going for 2 nibble lookups in a 16 byte table? Or a 2 bit swap and a 6 bit lookup in a 64B table? (current x86-64 CPUs typically have 64B L1 cache lines?)
8 words to an L1 line on the original iPhone ARM, so yes, you'd fit it into a smaller table. You'll still face the memory latency issue if you're using this outside a tight loop. (Cache is only 4-way set associative. But at least I & D cache are separate)

But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.

Why yes, yes I do miss fiddling around with low level details. :)

... don't C++ people always tell us that inlining is the single easiest optimization for a compiler to perform?