Hacker News new | ask | show | jobs
by wholesomepotato 924 days ago
I think you're right. Even better. Did you have time to bench it, etc.?

I see what you mean by length. I just skimmed over the text originally as I don't have time for rather lame problems like this. I'd just add 3 bits of length to be part of the index, job done. 12KB lookup table instead of 4KB, assuming 0 is not a valid value (negate to avoid needing 0b11).

1 comments

Adding the length means another shift+or operation at minimum. I already think this is slower than the technique presented in the article, and this would make it worse.

It's an interesting idea, but I don't see it being practical, even if the size of the table wasn't an issue.

Instruction level parallelism will make extra shift free. Other than this it needs to be benched and might depend on cpu/arch. I don't care enough to bench and optimize further.
It's most definitely not free. It'd consume fetch bandwidth, decode/rename/scheduler slots, an execution port etc.

The comparison here is:

((v ^ 0x303030) * 0x640a0100) >> (len << 3)

against:

table[(((v >> 12) | v) & 0xfff) | (len << 12)]

The former is 4 ops, the latter is 6 ops, so throughput wise, the former wins. Latency wise, it also wins, considering that L1 cache lookups are generally 3-5 cycles, whilst integer multiply is typically 3-4.