I don't use lookup tables that much when optimizing these days, because most math instructions cost far fewer cycles than a cache miss. And even if your lookup table is in L3, it still costs ~40 cycles to retrieve each line.
That's a good instinct to have. However in this case the input is 0..63 and the output comfortably sits in a single byte. It can fit in just one cache line if you bother to align it.
The miss cost is, therefore, not really relevant: if this code is hotpath at all the cost of a single cache miss is amortized across millions of calls. Your lookup table will be in cache as surely as the code that reads it is.
The miss cost is, therefore, not really relevant: if this code is hotpath at all the cost of a single cache miss is amortized across millions of calls. Your lookup table will be in cache as surely as the code that reads it is.