| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jhasse 152 days ago
	That's where the standard should come in and say something like "starting with C++26 char is always 1 byte and signed. std::string is always UTF-8" Done, fixed unicode in C++. But instead we get this mess. I guess it's because there's too much Microsoft in the standard and they are the only ones not having UTF-8 everywhere in Windows yet.

2 comments

fluoridation 152 days ago

char is always 1 byte. What it's not always is 1 octet.

link

jhasse 151 days ago

you're right. What I meant was that it should always be 8 bit, too.

link

jstimpfle 152 days ago

std::string is not UTF-8 and can't be made UTF-8. It's encoding agnostic, its API is in terms of bytes not codepoints.

link

jhasse 151 days ago

Of course it can be made UTF-8. Just add a codepoints_size() method and other helpers.

But it isn't really needed anyway: I'm using it for UTF-8 (with helper functions for the 1% cases where I need codepoints) and it works fine. But starting with C++20 it's starting to get annoying because I have to reinterpret_cast to the useless u8 versions.

link

jstimpfle 150 days ago

First, because of existing constraints like mutability though direct buffer access, a hypothetical codepoints_size() would require recomputation each time which would be prohibitively expensive, in particular because std::string is virtually unbounded.

Second, there is also no way to be able to guarantee that a string encodes valid UTF-8, it could just be whatever.

You can still just use std::string to store valid encoded UTF-8, you just have to be a little bit careful. And functions like codepoints_size() are pretty fringe -- unless you're not doing specialized Unicode transformations, it's more typical to just treat strings as opaque byte slices in a typical C++ application.

link

jhasse 150 days ago

Perfect is the enemy of good. Or do you think the current mess is better?

link

jstimpfle 148 days ago

std::string _cannot_ be made "always UTF-8". Is that really so contentious?

You can still use it to contain UTF-8 data. It is commonly done.

link

jhasse 148 days ago

I never said always. Just add some new methods for which it has to be UTF-8. All current functions that need an encoding (e.g. text IO) also switch to UTF-8. Of course you could still save arbitrary binary data in it.

link