|
|
|
|
|
by avianlyric
1600 days ago
|
|
In languages like C “string” isn’t a proper data structure, it’s a `char` array, which itself is little more than a `int` array or `byte` array. But these languages don’t provide true “string” support. They just have a vaguely useful type alias that renames a byte array to a char array, and a bunch of byte array functions that have been renamed to sound like string functions. In reality all the language supports are byte arrays, with some syntactical sugar so you can pretend they’re strings. Newer languages, like go and Python 3, that where created in the world of Unicode provide true string types. Where the type primitives properly deal with idea of variable length characters, and provide tools to make it easy to manipulate strings and characters as independent concepts. If you want to ignore Unicode, because your specific application doesn’t need to understand, then you need cast your strings into byte arrays, and all pretences of true string manipulation vanish at the same time. This is not to say the C can’t handle Unicode etc. just like the language doesn’t provide true primitives to manipulate strings, instead relies on libraries to provide that functionality, which is perfectly valid approach. Just as baking in more complex string primitives into your language is also a perfectly valid approach. It’s just a question of trade offs and use cases, I.e. the problem at the heart of all good engineering. |
|
All you gain by having Unicode code point strings is the illusion of Unicode support until you test anything that uses combining characters or variant selectors. In essence, languages opting for such strings are making the same mistake at Windows/Java/etc. did when adopting UTF-16.