|
|
|
|
|
by sw00pur
3214 days ago
|
|
>No such thing! Strings are an array of integer unicode code points. I'd argue that, generally, strings are simply arrays of chars, which are bytes. THe failure here, was keeping the name "string" for what are arrays of codepoints instead of bytes. |
|
Unicode strings are arrays of code points which are 21bit numbers.
If the API requires fast subscript (it usually does) then they would be UTF-32 or three-codepoints-in-int64, otherwise more compact internal representation is possible.
If you don't require supporting subscript and allow only iteration over list of code points then in-memory representation of strings can be more compact. It can use UTF-8 or even SCSU or BOCU1.
Some languages use polymorphic unicode strings which store ascii if the value is all-ascii and switch to something else if it isn't (python3.3 and factor come to mind).