“The basic literal character set consists of all characters of the basic character set, plus the following control characters”
That page also explicitly says:
The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.
Code unit Character Glyph
U+0024 Dollar Sign $
U+0040 Commercial At @
U+0060 Grave Accent `”*
If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.
Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)
“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*
If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23
Also, chances are this changed in subtle ways between C and C++ versions.
One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be
"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"
This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.
Sure. For instance, there are times when you need to pack strings tightly together. Adding an extra byte or two before the start of the string would get in the way. You could work around it in many cases, but it makes the code uglier and harder to understand/maintain.
One of the things that makes C particularly suitable for certain sorts of tasks is that it's mostly WYSIWYG when it comes to the relationship between data structures and the actual memory layout. Having "hidden" things like a length value before the string steps on that.
if you wanted to pack strings together tightly, couldn't your string library have a separate "array" concept where all the sizes are stored separately?
If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:
“The basic literal character set consists of all characters of the basic character set, plus the following control characters”
That page also explicitly says:
The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.
If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)
Also (https://en.cppreference.com/w/cpp/language/charset):
“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*
If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23
Also, chances are this changed in subtle ways between C and C++ versions.