| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eesmith 945 days ago

> takes twice as many instructions

What is your preferred system? How does it affect other needs, like collation, or testing if something is upper-case vs. lower-case, or ease of supporting case-insensitivity?

Have you measured the performance difference? https://johnnylee-sde.github.io/Fast-unsigned-integer-to-hex... shows a branchless UlongToHexString which is essentially as fast as a lookup table and faster than the "naive" implementation.

> Bounds-checking for the English alphabet

In the following it goes from 2 assembly instructions to three:

  int is_letter(char c) {
    c |= 0x20;  // normalize to lowercase
    return ('a' <= c) && (c <= 'z');
  }

Yes, that's 50% more assembly, to add a single bit-wise or, when testing a single character.

But, seriously, when is this useful? English words include an apostrophe, names like the English author Brontë use diacritics, and æ is still (rarely) used, like in the "Endowed Chair for Orthopædic Investigation" at https://orthop.washington.edu/research/ourlabs/collagen/peop... .

And when testing multiple characters at a time, there are clever optimizations like those used in UlongToHexString. SIMD within a register (SWAR) is quite powerful, eg, 8 characters could be or'ed at once in 64 bits, and of course the CPU can do a lot of work to pipeline things, so 50% more single-clock-tick instructions does not mean %50 more work.

> like front and back braces/(angle)brackets/parens not being convertible

I have never needed that operation. Why do you need it?

Usually when I find a "(" I know I need a ")", and if I also allow a "[" then I need an if-statement anyway since A(8) and A[8] are different things, and both paths implicitly know what to expect.

> and saved a few instructions in common parsing loops.

Parsing needs to know what specific character comes next, and they are very rarely limited to only those characters. The ones I've looked use a DFA, eg, via a switch statement or lookup table.

I can't figure out what advantage there is to that ordering, that is, I can't see why there would be any overall savings.

Especially in a language like C++ with > and >> and >>= and A<B<int>> and -> where only some of them are balanced.