Both utf8 and utf16 can contain multicharacter elements. If you split a string at an arbitrary point you risk splitting it inside a multicharacter element.
This will be very common in utf8 that contains non-ascii characters, and very rare with utf16 (only happens with characters outside the BMP).
Neither is something you want in your code, unless you think it's a good idea to corrupt your users' data.
Edit: It's not too difficult to handle these cases and make sure you only split at valid positions, but you do need to be careful and there are a number of edge cases you might not think through or even encounter unless you have the right sort of data to test with - which leads to lots of faulty implementations. e.g. for years MySQL couldn't handle utf8 characters outside the BMP.
My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.
I do know that Unicode has combining code points (confusingly called combining characters) and nasty things like rtl switching code points. I guess it's turtles all the way down.
Again, my original parent's statement was not about encoding or memory savings. The statement was that it was a bad idea to index into an (abstract) unicode string (of unicode code points -- not compositions thereof whatsoever).
I didn't question that, but hoped to get some inspiration for sane usage of unicode handling (which I'm not sure is humanly possible except for treating it as a rather black box and make no promises).
Your original parent was all about encodings, and mentioned it was a bad idea to arbitrarily index in to utf8 strings, (no mention of abstract strings of unicode codepoints).
> languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing
So it's saying Rust mostly benefits from using utf8, but in doing so, it loses the ability to arbitrarily index a character in a string (in constant time).
If it was abstract strings of unicode codepoints then there is no problem - except you'd then be using 32bits per codepoint.
This will be very common in utf8 that contains non-ascii characters, and very rare with utf16 (only happens with characters outside the BMP).
Neither is something you want in your code, unless you think it's a good idea to corrupt your users' data.
Edit: It's not too difficult to handle these cases and make sure you only split at valid positions, but you do need to be careful and there are a number of edge cases you might not think through or even encounter unless you have the right sort of data to test with - which leads to lots of faulty implementations. e.g. for years MySQL couldn't handle utf8 characters outside the BMP.