Hacker News new | ask | show | jobs
by jrabone 4948 days ago
No. Combining characters and NF(K)C/D normalisation rules are a different problem entirely - consider the "heavy metal umlaut" (ie. Spın̈al Tap) where there is no lossless conversion possible - only “n" followed by U+0308
1 comments

They're facets of the same problem. I shouldn't routinely be dealing with either surrogates or combining marks; unless I have a specific reason, it's only an opportunity to make a mistake that hardly anyone knows how to troubleshoot. "n̈" should be an indivisible string of length one until I need to ask how it would actually be encoded in UTF-16 or whatever.
But that's the point - there is no such character. Given the Unicode consortium have added codepoints for every other bloody thing under the sun, I'm amazed that there isn't one for n-diaresis but there you are.

Add a small number of people who for artistic reasons decide that they want to make life hard (Rinôçérôse I'm looking at you) and you just have to accept that the length of your string might not equal the number of codepoints contained therein...