| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ubernostrum 3251 days ago

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

Depends on what you want to index into it for. I'll admit that once upon a time I opposed adding a "truncate at N characters" template helper to Django since there was a real risk it would cut in the middle of a grapheme cluster, and I don't particularly care for the compromise that ended up getting it added (it normalizes the string-to-truncate to a composed form first to try to minimize the chance of slicing at a bad spot).

But when you get right down to it, what I do for a living is write web applications, and sometimes I have to write validation that cares about length, or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time, and I'd rather have it behave as a sequence of code points than have it behave as a sequence of bytes in a variable-width encoding.

As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII.

2 comments

Animats 3251 days ago

Yes, while "back up one UTF-8 rune" is a well defined operation, "back up one grapheme" is tough. Forward is easy, though.

I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.

[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...

link

Avernar 3251 days ago

> But when you get right down to it, what I do for a living is write web applications,

That is my use case for Python as well.

> sometimes I have to write validation that cares about length,

That's where a trucation function that understands grapheme clusters whould come in so handy. Tell it that you want to truncate to n bytes maximum and let it chop a bit more as to not split a grapheme cluster.

Fortunately my database does not have fixed with strings so I rarely bump into this one.

> or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time

I write my code to avoid this. Yes I still have to use an index because that's what Python supports but it would be trivial to convert it to another language that supports string iterators.

link