| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ChrisSD 145 days ago
	`char8_t` is probably one of the more baffling blunders of the standards committee.

1 comments

jjmarr 145 days ago

there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.

If your codebase has those guarantees, go ahead and use it.

link

hackyhacky 144 days ago

> there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.

True, but sizeof(char) is defined to be 1. In section 7.6.2.5:

"The result of sizeof applied to any of the narrow character types is 1"

In fact, char and associated types are the only types in the standard where the size is not implementation-defined.

So the only way that a C++ implementation can conform to the standard and have a char type that is not 8 bits is if the size of a byte is not 8 bits. There are historical systems that meet that constraint but no modern systems that I am aware of.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/n49...

link

int_19h 143 days ago

That would be any CPU with word-addressing only. Which, granted, is very exotic today, but they do still exist: https://www.analog.com/en/products/adsp1802.html

link

gpderetta 144 days ago

Don't some modern DSPs still have 32bit as minimum addressable memory? Or is it a thing of the past?

link

AnimalMuppet 144 days ago

If you're on such a system, and you write code that uses char, then perhaps you deserve whatever mess that causes you.

link

20k 145 days ago

char8_t also isn't guaranteed to be 8-bits, because sizeof(char) == 1 and sizeof(char8_t) >= 1. On a platform where char is 16 bits, char8_t will be 16 bits as well

The cpp standard explicitly says that it has the same size, typed, signedness and alignment as unsigned char, but its a distinct type. So its pretty useless, and badly named

link

1718627440 144 days ago

Wouldn't it be rather the case that char8_t just wouldn't exist on that platform? At least that's the case with the uintN_t types, they are just not available everywhere. If you want something that is always available you need to use uintN_least_t or uintN_fast_t.

wtf

It is pretty consistent. It is part of the C Standard and a feature meant to make string handling better, it would be crazy if it wasn't a complete clusterfuck.

link

Maxatar 145 days ago

There's no guarantee char8_t is 8 bits either, it's only guaranteed to be at least 8 bits.

link

hackyhacky 144 days ago

> There's no guarantee char8_t is 8 bits either, it's only guaranteed to be at least 8 bits.

Have you read the standard? It says: "The result of sizeof applied to any of the narrow character types is 1." Here, "narrow character types" means char and char8_t. So technically they aren't guaranteed to be 8 bits, but they are guaranteed to be one byte.

link

adrian_b 144 days ago

Yes, but the byte is not guaranteed to be 8 bits, because on many ancient computers it wasn't.

The poster to whom you have replied has read correctly the standard.

link

CyberDildonics 144 days ago

What platforms have char8_t as more than 8 bits?

link

marcthe12 144 days ago

Well platforms with CHAR_BIT != 8. In c and c++ char and there for byte is atleast 8 bytes not 8 bytes. POSIX does force CHAR_BIT == 8. I think only place is in embeded and that to some DSPs or ASICs like device. So in practice most code will break on those platforms and they are very rare. But they are still technically supported by c and c++ std. Similarly how c still suported non 2's complement arch till 2023.

link

jhasse 144 days ago

That's where the standard should come in and say something like "starting with C++26 char is always 1 byte and signed. std::string is always UTF-8" Done, fixed unicode in C++.

But instead we get this mess. I guess it's because there's too much Microsoft in the standard and they are the only ones not having UTF-8 everywhere in Windows yet.

link

fluoridation 144 days ago

char is always 1 byte. What it's not always is 1 octet.

link

jhasse 144 days ago

you're right. What I meant was that it should always be 8 bit, too.

link

jstimpfle 144 days ago

std::string is not UTF-8 and can't be made UTF-8. It's encoding agnostic, its API is in terms of bytes not codepoints.

link

jhasse 144 days ago

Of course it can be made UTF-8. Just add a codepoints_size() method and other helpers.

But it isn't really needed anyway: I'm using it for UTF-8 (with helper functions for the 1% cases where I need codepoints) and it works fine. But starting with C++20 it's starting to get annoying because I have to reinterpret_cast to the useless u8 versions.

link

jstimpfle 143 days ago

First, because of existing constraints like mutability though direct buffer access, a hypothetical codepoints_size() would require recomputation each time which would be prohibitively expensive, in particular because std::string is virtually unbounded.

Second, there is also no way to be able to guarantee that a string encodes valid UTF-8, it could just be whatever.

You can still just use std::string to store valid encoded UTF-8, you just have to be a little bit careful. And functions like codepoints_size() are pretty fringe -- unless you're not doing specialized Unicode transformations, it's more typical to just treat strings as opaque byte slices in a typical C++ application.

link

jhasse 143 days ago

Perfect is the enemy of good. Or do you think the current mess is better?

link

dataflow 145 days ago

How many non-8-bit-char platforms are there with char8_t support, and how many do we expect in the future?

link

RobotToaster 145 days ago

Mostly DSPs

link

LexiMax 144 days ago

Is there a single esoteric DSP in active use that supports C++20? This is the umpteenth time I've seen DSP's brought up in casual conversations about C/C++ standards, so I did a little digging:

Texas Instruments' compiler seems to be celebrating C++14 support: https://www.ti.com/tool/C6000-CGT

CrossCore Embedded Studio apparently supports C++11 if you pass a switch in requesting it, though this FAQ answer suggests the underlying standard library is still C++03: https://ez.analog.com/dsp/software-and-development-tools/cce...

Everything I've found CodeWarrior related suggests that it is C++03-only: https://community.nxp.com/pwmxy87654/attachments/pwmxy87654/...

Aside from that, from what I can tell, those esoteric architectures are being phased out in lieu of running DSP workloads on Cortex-M, which is just ARM.

I'd love it if someone who was more familiar with DSP workloads would chime in, but it really does seem that trying to be the language for all possible and potential architectures might not be the right play for C++ in 202x.

Besides, it's not like those old standards or compilers are going anywhere.

link

dspwizard 144 days ago

Cadence DSPs have C++17 compatible compiler and will be c++20 soon, new CEVA cores also (both are are clang based). TI C7x is still C++14 (C6000 is ancient core, yet still got c++14 support as you mentioned). AFIR Cadence ASIP generator will give you C++17 toolchain and c++20 is on roadmap, but not 100% sure.

But for those devices you use limited subset of language features and you would be better of not linking c++ stdlib and even c stdlib at all (so junior developers don't have space for doing stupid things ;))

link

pkasting 144 days ago

Green Hills Software's compiler supports more recent versions of C++ (it uses the EDG frontend) and targets some DSPs.

Back when I worked in the embedded space, chips like ZSP were around that used 16-bit bytes. I am twenty years out of date on that space though.

link

LexiMax 144 days ago

How common is it to use Green Hills compilers for those DSP targets? I was under the impression that their bread was buttered by more-familiar-looking embedded targets, and more recently ARM Cortex.

link

BoredomIsFun 144 days ago

> but it really does seem that trying to be the language for all possible and potential architectures might not be the right play for C++ in 202x.

Portability was always a selling point of C++. I'd personaly advise those who find it uncomfortable, to choose a different PL, perhaps Rust.

link

LexiMax 144 days ago

> Portability was always a selling point of C++.

Judging by the lack of modern C++ in these crufty embedded compilers, maybe modern C++ is throwing too much good effort after bad. C++03 isn't going away, and it's not like these compilers always stuck to the standard anyway in terms of runtime type information, exceptions, and full template support.

Besides, I would argue that the selling point of C++ wasn't portability per se, but the fact that it was largely compatible with existing C codebases. It was embrace, extend, extinguish in language form.

link

dataflow 145 days ago

Non-8-bit-char DSPs would have char8_t support? Definitely not something I expected, links would be cool.

link

j16sdiz 144 days ago

Why not? except it is same as `unsigned char` and can be larger than 8 bit

ISO/IEC 9899:2024 section 7.30

> char8_t which is an unsigned integer type used for 8-bit characters and is the same type as unsigned char;

link

dataflow 144 days ago

> Why not?

Because "it supports Unicode" is not an expected use case for a non-8-bit DSP?

Do you have a link to a single one that does support it?

link

kevin_thibedeau 143 days ago

The exact size types are never present on platforms that don't support them.

link

dspwizard 144 days ago

TI C2000 is one example

link

dataflow 144 days ago

Thank you. I assume you're correct, though for some reason I can't find references claiming C++20 being supported with some cursory searches.

link

Asmod4n 144 days ago

char on linux arm is unsigned, makes for fun surprises when you only ever dealt with x86 and assumed char to be signed everywhere.

link

pkasting 144 days ago

This bit us in Chromium. We at least discussed forcing the compiler to use unsigned char on all platforms; I don't recall if that actually happened.

link

MaskRay 144 days ago

I recall that google3 switched to -funsigned-char for x86-64 a long time ago.

link

pkasting 144 days ago

A cursory Chromium code search does not find anything outside third_party/ forcing either signed or unsigned char.

I suspect if I dug into the archives, I'd find a discussion on cxx@ with some comments about how doing this would result in some esoteric risk. If I was still on the Chrome team I'd go looking and see if it made sense to reraise the issue now; I know we had at least one stable branch security bug this caused.

link

kps 144 days ago

Related: in C at least (C++ standards are tl;dr), type names like `int32_t` are not required to exist. Most uses, in portable code, should be `int_least32_t`, which is required.

link