Instruction level parallelism will make extra shift free. Other than this it needs to be benched and might depend on cpu/arch. I don't care enough to bench and optimize further.
It's most definitely not free. It'd consume fetch bandwidth, decode/rename/scheduler slots, an execution port etc.
The comparison here is:
((v ^ 0x303030) * 0x640a0100) >> (len << 3)
against:
table[(((v >> 12) | v) & 0xfff) | (len << 12)]
The former is 4 ops, the latter is 6 ops, so throughput wise, the former wins. Latency wise, it also wins, considering that L1 cache lookups are generally 3-5 cycles, whilst integer multiply is typically 3-4.
The comparison here is:
((v ^ 0x303030) * 0x640a0100) >> (len << 3)
against:
table[(((v >> 12) | v) & 0xfff) | (len << 12)]
The former is 4 ops, the latter is 6 ops, so throughput wise, the former wins. Latency wise, it also wins, considering that L1 cache lookups are generally 3-5 cycles, whilst integer multiply is typically 3-4.