| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by KaeseEs 4747 days ago
	Great analysis, although I'm curious how the idea that doing a bunch of 64 bit ops in order to accomplish byte arithmetic came about to begin with - was the function in question not written by a firmware guy?

2 comments

zwieback 4747 days ago

I think this one goes back to PDP days and wasn't necessarily written to be the fastest possible implementation. The PDP could do 36*36 multiply into 72 bits. Not sure how the modulo instruction performed but there was a DIV instruction.

link

binarymax 4747 days ago

Down the rabbit hole says this came from HAKMEM No. 239 in 1972!

http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167

link

groby_b 4747 days ago

Or the programmer was a firmware hacker, and she knew that RBIT is ARMv6T2, which IIRC wasn't available until the iPhone 3GS. (Not 100% certain, and I don't have my manuals handy)

link

stephencanon 4747 days ago

... in which case she would have used a lookup table, unless she was a really old-school firmware hacker and still believed that you couldn’t justify 256B for the table. (FWIW, you’re right about ARMv6T2).

link

groby_b 4747 days ago

I am. 256 bytes pain me :) (I've worked on systems with that much RAM total)

Kidding aside, I'd probably not go for the lookup table unless the whole thing was necessary in an inner loop - the cache miss cost is high. And since it's in a function, it better not be in an inner loop :)

link

caf 4747 days ago

Surely there is an argument that if it's called infrequently enough that cache misses on the LUT are a problem, then it's also called so infrequently that its performance is irrelevant.

link

nly 4747 days ago

256B is quite large for sure, but what about going for 2 nibble lookups in a 16 byte table? Or a 2 bit swap and a 6 bit lookup in a 64B table? (current x86-64 CPUs typically have 64B L1 cache lines?)

link

groby_b 4747 days ago

8 words to an L1 line on the original iPhone ARM, so yes, you'd fit it into a smaller table. You'll still face the memory latency issue if you're using this outside a tight loop. (Cache is only 4-way set associative. But at least I & D cache are separate)

But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.

Why yes, yes I do miss fiddling around with low level details. :)

link

saalweachter 4747 days ago

... don't C++ people always tell us that inlining is the single easiest optimization for a compiler to perform?

link