|
|
|
|
|
by glangdale
2956 days ago
|
|
By the unwritten conventions of bit hacks folks, 'builtin_clz' is not a primitive (and elsewhere in that same document you will find a brace of different ways of doing clz/ctz). Many of these documents are old and the techniques in them older still (dating back to HAKMEM). You are definitely correct on most modern architectures which have a single operation count leading/trailing zeros that are typically no more expensive than an integer multiply. This has been latency = 3, throughput = 1 for most recent Intel arch. I think a redo assuming modern architectures is long overdue; i.e. let's work off the basis that count leading/trailing zeros, PDEP/PEXT, etc are fast and cheap and start from there. Bonus round: if you had a lot of these operations to do, then you still might be able to go faster on x86 on SSE2 or AVX2 by using this trick, as the penalty to pass a SIMD register full of integers to a floating point op or vice versa is just one cycle. This trick becomes obsolete yet again with AVX512, which introduces parallel count leading zeros. |
|
Also multiply adds can be fused so if you're doing that it cab be faster to just multiply by a different number instead of bit twiddling.