Hacker News new | ask | show | jobs
by wolf550e 3214 days ago
C does not own the word "string". A string is a piece if text. It is not a byte array.

Unicode strings are arrays of code points which are 21bit numbers.

If the API requires fast subscript (it usually does) then they would be UTF-32 or three-codepoints-in-int64, otherwise more compact internal representation is possible.

If you don't require supporting subscript and allow only iteration over list of code points then in-memory representation of strings can be more compact. It can use UTF-8 or even SCSU or BOCU1.

Some languages use polymorphic unicode strings which store ascii if the value is all-ascii and switch to something else if it isn't (python3.3 and factor come to mind).