| Depends on where you draw the line of what a parser is: - If the parser is "the thing that comes after the lexer" then all of this is abstracted away by the lexer and you can just treat it as a span of bytes; - If the parser is "everything that needs to be implemented to correctly transduce the input sequence into a tree", then you need to implement this yourself or have a lexer that handles this for you, usually done by having a tiny UTF-8 codepoint recognizing FSM in your lexer (UTF-8 is a self-synchronizing code, which makes this part easier) and ignoring the existence of graphemes. Most people, however, shy away from implementing a parser "all the way down to the bytes" and properly handling UTF-8 as a formal language. Most lean on a lexer abstracting this away. Ditto for context-sensitivity. Recently Rust's regex engine underwent a major overhaul, and burntsushi wrote a blog post[0] about doing the "all the way to the bytes" thing in the new regex engine, I highly recommend the read: [0] https://blog.burntsushi.net/regex-internals/#nfa-optimizatio... |
Most of the lexical/syntactic elements of languages are not in UTF-8. You're looking for things like semicolons and quotes and whitespace. If you don't change the language syntax/lexical elements so that those parts stay as the ASCII subset of UTF-8 then why does your lexer need to be aware of UTF-8? It can just accumulate everything else as bytes and it doesn't matter what format the bytes are. The parser and/or codegen will do equality checks for lookups later on but that doesn't need to be UTF-8 aware either?
Am I missing something?