| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stephencanon 4747 days ago
	... in which case she would have used a lookup table, unless she was a really old-school firmware hacker and still believed that you couldn’t justify 256B for the table. (FWIW, you’re right about ARMv6T2).

1 comments

groby_b 4747 days ago

I am. 256 bytes pain me :) (I've worked on systems with that much RAM total)

Kidding aside, I'd probably not go for the lookup table unless the whole thing was necessary in an inner loop - the cache miss cost is high. And since it's in a function, it better not be in an inner loop :)

link

caf 4747 days ago

Surely there is an argument that if it's called infrequently enough that cache misses on the LUT are a problem, then it's also called so infrequently that its performance is irrelevant.

link

nly 4747 days ago

256B is quite large for sure, but what about going for 2 nibble lookups in a 16 byte table? Or a 2 bit swap and a 6 bit lookup in a 64B table? (current x86-64 CPUs typically have 64B L1 cache lines?)

link

groby_b 4747 days ago

8 words to an L1 line on the original iPhone ARM, so yes, you'd fit it into a smaller table. You'll still face the memory latency issue if you're using this outside a tight loop. (Cache is only 4-way set associative. But at least I & D cache are separate)

But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.

Why yes, yes I do miss fiddling around with low level details. :)

link

saalweachter 4747 days ago

... don't C++ people always tell us that inlining is the single easiest optimization for a compiler to perform?

link

groby_b 4747 days ago

Yeah. Sure. Especially across libraries.

If you're talking to a dev with a background in firmware dev, I'd say you'd be hard pressed to find any body who'll put a single asm instruction into a separate function if it's speed-critical.

Sure, the compiler might (and probably will) inline, but it means any change in your tool chain or a whim of the inline heuristic can cause serious performance regressions that are completely avoidable.

When it comes to performance, the embedded mindset will always be "belt and suspenders" ;)

link