| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pjz 990 days ago
	Here's an in-the-wild, long-standing example of a table used for bit reversals: https://gitlab.com/libtiff/libtiff/-/blob/master/libtiff/tif... The function to do the same for a byte is... what? 3 or 4 lines long in C? I guess it takes a couple of temp variables, though, which may dissuade some devs.

4 comments

a1369209993 990 days ago

That's not "A lookup table is faster than a native bit reversal instruction."; that's "There's no portable way to make the compiler emit a native bit reversal instruction, and a lookup table is (presumably) faster than a couple dozen instructions to do it manually (at least on most target architectures).".

jeffbee 990 days ago

What about something like https://github.com/abseil/abseil-cpp/blob/master/absl/string... ? ASCII lowercase can be something like

     auto is_upper= (x > '@') & (x < '[');
     return is_upper ? x+32 : x;

Which boils down to add, add, cmp, cmov on x86. It might save some cache traffic, but the cache traffic is less code since it's only a mov. The version that's code instead of lookups probably exacerbates register pressure problems in large programs? I don't have good intuition for which would be more efficient in practice.

dragontamer 990 days ago

Your 8-bit lookup can never be parellized, while the add/cmp/cmov is really easy AVX512 that probably auto-inlines and auto-vectorizes.

I dunno, it's ridiculously a benefit to the code by my instinct. While lookup table looks pretty bad.

jeffbee 990 days ago

> can never be parellized

I mean, I'm not skilled enough in those ISA extensions to stick my neck out. It's not totally obvious to me that there is not some shuffle or permute facility that can load 64 bytes at a time from LUTs.

dragontamer 990 days ago

> It's not totally obvious to me that there is not some shuffle or permute facility that can load 64 bytes at a time from LUTs.

You are talking about vgather and/or vscatter, which are well known to be very slow AVX2 or AVX512 instructions.

Maybe a future CPU will make these instructions high performance. But no modern 2023-era CPU has a high-speed vgather.

Like: the vgather does one-at-a-time slow. You basically lose parallelism even if the vgather instruction describes what you want to do, it's not an effective parallel operation today (and may never be)

--------

Pshufb as a 4-bit LUT is an exception and is effectively a high speed (but very very small) lookup table. Like every cycle 64 bytes at a time fast.

You are limited by the 16-byte lookup size (aka the size of an SSE register), maybe a bit bigger if there are new instructions I dunno about.

gpderetta 990 days ago

Vectorized gather loads can do just that. But currently are pretty underwhelming on x86 machines.

gpderetta 990 days ago

Although it can't be easily vectorized, it can be pipelined, that can help both hide the access latency and keep the lookup table hot. The code is not going to be pretty thorough.

vardump 990 days ago

> Which boils down to add, add, cmp, cmov on x86.

By using SIMD instructions like AVX2 you can do 32 characters in parallel. With AVX512, 64.

Pretty easy to write something way faster by not using a LUT.

nwellnhof 990 days ago

Yes, this is a good example where LUTs are very likely to be slower, especially if you can avoid branch mispredictions with cmov.

tgv 990 days ago

Except that in Unicode, this doesn't work.

dragontamer 990 days ago

The example code is a 256-byte lookup table, meaning the original example won't work with Unicode either.

In fact, I'm pretty sure that Unicode cannot be solved with a lookup table due to its variable length, but maybe you can prove me wrong? (Assuming UTF8 here)

tgv 990 days ago

256 is indeed pretty limited.

Unicode does have a limited space, but it cannot be stored practically in single table. It currently runs up to 0x323AF, a bit over 200k, and most of the characters of course don't have a lower/uppercase mapping. The implementations I've seen do a few comparisons and then delegate to a table.

But: horses for courses. If you have to normalize a lot of Latin-1 text (such transform case, or strip diacritics), you can probably write some vector instructions that runs circles around a simple LUT. But it's not going to be as easy.

ilyt 989 days ago

What the fuck you expect "bit reversal" to even do in unicode ?

tgv 989 days ago

I had the vague impression this was about lowercase. No need for expletives. Perhaps you should comment elsewhere.

robocat 990 days ago

Comedy: did you notice that the next static definition is for a 256 byte lookup table for the unreversed bits too:

  TIFFNoBitRevTable[256] = {
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, ... 0xFF }

Maybe PSHUFB/VTBL to reverse each nibble and then put the reversed nibbles together would be wayyy faster?

I'm guessing portability matters here, and there is not much commercial pressure to actually optimise this TIFF library?

ilyt 989 days ago

Which is not simple add or xor, which is why lookup is used... I obviously wasn't talking about it.

The most used one is probably sin() on CPUs without hardware math