| > - First step (for a UTF-8-input terminal) is interpreting the input bytestream as UTF-8 > - Second step is "parsing" the scalar values by running them through the DEC parser/state machine. Unfortunately, you may need to intermingle some logic between these two steps. While VT100 style control sequences are usually introduced with an ESC, they can also be represented as a C1 control sequence, e.g. 0x84 instead of ESC + D, 0x9b instead of ESC + [. These sequences are raw bytes, not Unicode codepoints, and their encoding collides unpleasantly with UTF-8 continuation characters. Further documentation: https://vt100.net/docs/vt220-rm/chapter4.html Since there's no standard which specifies how UTF-8 should interact with the terminal parser, you're a little bit on your own here. But probably the simplest fix is to introduce a special case into the UTF-8 decoder which allows stray continuation characters to be passed through to the DEC parser, rather than transforming them to replacement characters immediately. |
"UTF-8 still allows you to use C1 control characters such as CSI, even though UTF-8 also uses bytes in the range 0x80-0x9F. It is important to understand that a terminal emulator in UTF-8 mode must apply the UTF-8 decoder to the incoming byte stream before interpreting any control characters. C1 characters are UTF-8 decoded just like any other character above U+007F."
The existing ANSI terminal emulators that support UTF-8 input and C1 controls seem to agree on this (VTE, GNU screen, Mosh). xterm, urxvt, tmux, PuTTY, and st don't seem to support C1 controls in UTF-8 mode. So I don't think poking holes in the UTF-8 decoder is necessary, especially since allowing C1 in UTF-8 mode is rare anyway.