|
|
|
|
|
by ubernostrum
3200 days ago
|
|
Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon. Depends on what you want to index into it for. I'll admit that once upon a time I opposed adding a "truncate at N characters" template helper to Django since there was a real risk it would cut in the middle of a grapheme cluster, and I don't particularly care for the compromise that ended up getting it added (it normalizes the string-to-truncate to a composed form first to try to minimize the chance of slicing at a bad spot). But when you get right down to it, what I do for a living is write web applications, and sometimes I have to write validation that cares about length, or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time, and I'd rather have it behave as a sequence of code points than have it behave as a sequence of bytes in a variable-width encoding. As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII. |
|
I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.
[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...