| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tialaramex 442 days ago

> But is there any long-lived project for which this isn't true?

I don't think any other project like this exists. You're coming up on your 25th anniversary without shipping the release software !

I see that BString itself also uses this weird phrase "UTF-8 character". That's not a thing, and rather than just being technically wrong it's so weird I can't tell what the people who made it thought they meant or what the practical consequences might be.

I mean, it can't be worse than std::string in one sense because hey at least it picked... something. But if I can't figure out what that is maybe it's not better.

UTF-8 has code units, but they're one byte, so distinguishing them from bytes means either you're being weird about what a "byte" is or more likely you don't mean code units.

Unicode has characters, but well lets quote their glossary: "(1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]"

So given BString is software it's probably working in terms of something concrete. My best guesses (plural, like I said, I'm not sure and I'm not even sure the author realised they needed to decide):

1. UTF16 code units. This is the natural evolution of software intended for UCS-2 in a world where that's not a thing, our world.

2. Unicode code points. If you were stubbornly determined to keep doing the same thing despite the fact UCS2 didn't happen, you might get here, which is tragic

3. Unicode scalar values. Arguably useful, although in an intensely abstract way, the closest thing a bare metal language might attempt as a "character"

4. Graphemes. Humans think these are a reasonable way to cut up written language, which is a shame because machines can't necessarily figure out what is or is not a grapheme. But maybe the software tries to do this? There have been better and worse attempts.

I don't love std::vector but I can't see anything to recommend BList at all, it's all using type erased pointers, it doesn't have the correct reservation API, it provides its own weird sorting - which doesn't even say whether it's a stable sort,

2 comments

waddlesplash 442 days ago

It's Unicode code points. I don't know why you say this is "tragic", it's a logical unit to work in here.

link

GoblinSlayer 442 days ago

I suppose it means text encoding is known to be UTF-8.

link

GoblinSlayer 441 days ago

Edit: oww, the *Chars* method family? Well, that one is bad. STL is sort of lucky here as it tried to figure out unicode when it was already well known.

link