|
|
|
|
|
by imron
3780 days ago
|
|
Both utf8 and utf16 can contain multicharacter elements. If you split a string at an arbitrary point you risk splitting it inside a multicharacter element. This will be very common in utf8 that contains non-ascii characters, and very rare with utf16 (only happens with characters outside the BMP). Neither is something you want in your code, unless you think it's a good idea to corrupt your users' data. Edit: It's not too difficult to handle these cases and make sure you only split at valid positions, but you do need to be careful and there are a number of edge cases you might not think through or even encounter unless you have the right sort of data to test with - which leads to lots of faulty implementations. e.g. for years MySQL couldn't handle utf8 characters outside the BMP. |
|
I do know that Unicode has combining code points (confusingly called combining characters) and nasty things like rtl switching code points. I guess it's turtles all the way down.