Hacker News new | ask | show | jobs
by t-3 945 days ago
I never questioned the competence of past engineers, I question the use of backwards compatibility.

Hardware has advanced, but software depends on standards and conventions formulated for far less capable hardware, and that's a problem.

The efficiency of string processing/generation is hugely important in terms of global energy consumption.

A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability.

Bounds-checking for the English alphabet requires either an upfront normalization or twice the checking, so 50-100% more instructions for that.

There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is.

[({< <-> >})] would have been just as or more useful than the alphabet being convertible and saved a few instructions in common parsing loops.

1 comments

> takes twice as many instructions

What is your preferred system? How does it affect other needs, like collation, or testing if something is upper-case vs. lower-case, or ease of supporting case-insensitivity?

Have you measured the performance difference? https://johnnylee-sde.github.io/Fast-unsigned-integer-to-hex... shows a branchless UlongToHexString which is essentially as fast as a lookup table and faster than the "naive" implementation.

> Bounds-checking for the English alphabet

In the following it goes from 2 assembly instructions to three:

  int is_letter(char c) {
    c |= 0x20;  // normalize to lowercase
    return ('a' <= c) && (c <= 'z');
  }
Yes, that's 50% more assembly, to add a single bit-wise or, when testing a single character.

But, seriously, when is this useful? English words include an apostrophe, names like the English author Brontë use diacritics, and æ is still (rarely) used, like in the "Endowed Chair for Orthopædic Investigation" at https://orthop.washington.edu/research/ourlabs/collagen/peop... .

And when testing multiple characters at a time, there are clever optimizations like those used in UlongToHexString. SIMD within a register (SWAR) is quite powerful, eg, 8 characters could be or'ed at once in 64 bits, and of course the CPU can do a lot of work to pipeline things, so 50% more single-clock-tick instructions does not mean %50 more work.

> like front and back braces/(angle)brackets/parens not being convertible

I have never needed that operation. Why do you need it?

Usually when I find a "(" I know I need a ")", and if I also allow a "[" then I need an if-statement anyway since A(8) and A[8] are different things, and both paths implicitly know what to expect.

> and saved a few instructions in common parsing loops.

Parsing needs to know what specific character comes next, and they are very rarely limited to only those characters. The ones I've looked use a DFA, eg, via a switch statement or lookup table.

I can't figure out what advantage there is to that ordering, that is, I can't see why there would be any overall savings.

Especially in a language like C++ with > and >> and >>= and A<B<int>> and -> where only some of them are balanced.