For the 8 bits case in the first benchmark, why isn't `Vec<u8>` just as fast as the rest? Is it due to the compiler emitting poorer instructions in that case or doing extra checks the other implementations don't?
It's not about poorer instructions; a get_unchecked on a Vec<u8> is just a single memory access, which is as good as it gets. The difference is likely down to cache locality effects created by the benchmark loop itself.
The benchmark does a million random reads. For the FixedVec implementation with bit_width=8, the underlying storage is a Vec<u64>. This means the data is 8x more compact than the Vec<u8> baseline for the same number of elements.
When the random access indices are generated, they are more likely to fall within the same cache lines for the Vec<u64>-backed structure than for the Vec<u8>. Even though Vec<u8> has a simpler access instruction, it suffers more cache misses across the entire benchmark run.
The benchmark does a million random reads. For the FixedVec implementation with bit_width=8, the underlying storage is a Vec<u64>. This means the data is 8x more compact than the Vec<u8> baseline for the same number of elements.
When the random access indices are generated, they are more likely to fall within the same cache lines for the Vec<u64>-backed structure than for the Vec<u8>. Even though Vec<u8> has a simpler access instruction, it suffers more cache misses across the entire benchmark run.