Hacker News new | ask | show | jobs
by RaisingSpear 925 days ago
Adding the length means another shift+or operation at minimum. I already think this is slower than the technique presented in the article, and this would make it worse.

It's an interesting idea, but I don't see it being practical, even if the size of the table wasn't an issue.

1 comments

Instruction level parallelism will make extra shift free. Other than this it needs to be benched and might depend on cpu/arch. I don't care enough to bench and optimize further.
It's most definitely not free. It'd consume fetch bandwidth, decode/rename/scheduler slots, an execution port etc.

The comparison here is:

((v ^ 0x303030) * 0x640a0100) >> (len << 3)

against:

table[(((v >> 12) | v) & 0xfff) | (len << 12)]

The former is 4 ops, the latter is 6 ops, so throughput wise, the former wins. Latency wise, it also wins, considering that L1 cache lookups are generally 3-5 cycles, whilst integer multiply is typically 3-4.