Hacker News new | ask | show | jobs
by derefr 3389 days ago
The problem is mostly caused with how programming languages present text strings. There's usually a String class, with methods that say they manipulate characters. Usually, though, they either manipulate bytes, or manipulate code-points, and it's often not clear which.

Which is usually a symptom of a deeper problem: using the same ADT to represent "known-valid decoded text" and "a slice of bytes that would maybe decode to text in some unspecified encoding", such that the methods manipulating that ADT are completely incoherent.

Honestly, I'm surprised we don't see more programming languages like Objective-C, that have very clear distinctions between their "Data" type and "String" type, where encoded text is an NSData (a buffer of bytes) while decoded, valid text is an NSString, and all the methods on NSStrings operate on the grapheme clusters that decoded text is composed of, rather than on the bytes or codepoints or "characters" that are only relevant to encoded text.

1 comments

Swift is especially good at this because almost all string ops are high level. Splitting is on EGCs (inherited from objc no doubt), and equality is normalized equality. You need to explicitly ask for other operations.