Hacker News new | ask | show | jobs
by cmyr 1599 days ago
Something I haven't seen mentioned yet is one of the most annoying things about regional indicator symbols, which is that interpreting them correctly requires arbitrary backtracking, and handling this correctly is very annoying for things like text fields.

Basically: A single, unpaired RIS counts as a single grapheme. Similarly, a pair of RIS count as a single grapheme. Now imagine if your cursor position is after an RTS, and you arrow backwards (assuming LTR text, imagine your cursor is to the right of an RIS, and you press the left arrow.) Your textbox should now move the cursor to the left by one grapheme. How do you figure out where this is, in code units? You basically have to scan backwards until you find the first non-RIS codepoint, and then you have to match them up into pairs to figure out if your left-arrow movement should correspond to a movement of one or two codepoints.

This is a longstanding source of bugs, and if you're bored you can play around with pasting a huge sequence of flags into a textfield and then trying to navigate around it with the arrow keys/mouse. There are some broken implementations out there.

edit: while I'm thinking about this I will point out that an alternative design, which would have solved this problem (and which was first pointed out to me by @raphlinus) would have been to have two separate sets of RI symbols, one for 'first position' and one for 'second position'; then you could always determine the appropriate cursor position without needing context. Isn't hindsight a wonderful thing?

1 comments

> Isn't hindsight a wonderful thing?

Gladly, the creators of UTF-18 did have that foresight so at least we don't have this problem at the code unit -> code point level.