|
|
|
|
|
by t-3
945 days ago
|
|
I never questioned the competence of past engineers, I question the use of backwards compatibility. Hardware has advanced, but software depends on standards and conventions formulated for far less capable hardware, and that's a problem. The efficiency of string processing/generation is hugely important in terms of global energy consumption. A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability. Bounds-checking for the English alphabet requires either an upfront normalization or twice the checking, so 50-100% more instructions for that. There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is. [({< <-> >})] would have been just as or more useful than the alphabet being convertible and saved a few instructions in common parsing loops. |
|
What is your preferred system? How does it affect other needs, like collation, or testing if something is upper-case vs. lower-case, or ease of supporting case-insensitivity?
Have you measured the performance difference? https://johnnylee-sde.github.io/Fast-unsigned-integer-to-hex... shows a branchless UlongToHexString which is essentially as fast as a lookup table and faster than the "naive" implementation.
> Bounds-checking for the English alphabet
In the following it goes from 2 assembly instructions to three:
Yes, that's 50% more assembly, to add a single bit-wise or, when testing a single character.But, seriously, when is this useful? English words include an apostrophe, names like the English author Brontë use diacritics, and æ is still (rarely) used, like in the "Endowed Chair for Orthopædic Investigation" at https://orthop.washington.edu/research/ourlabs/collagen/peop... .
And when testing multiple characters at a time, there are clever optimizations like those used in UlongToHexString. SIMD within a register (SWAR) is quite powerful, eg, 8 characters could be or'ed at once in 64 bits, and of course the CPU can do a lot of work to pipeline things, so 50% more single-clock-tick instructions does not mean %50 more work.
> like front and back braces/(angle)brackets/parens not being convertible
I have never needed that operation. Why do you need it?
Usually when I find a "(" I know I need a ")", and if I also allow a "[" then I need an if-statement anyway since A(8) and A[8] are different things, and both paths implicitly know what to expect.
> and saved a few instructions in common parsing loops.
Parsing needs to know what specific character comes next, and they are very rarely limited to only those characters. The ones I've looked use a DFA, eg, via a switch statement or lookup table.
I can't figure out what advantage there is to that ordering, that is, I can't see why there would be any overall savings.
Especially in a language like C++ with > and >> and >>= and A<B<int>> and -> where only some of them are balanced.