|
|
|
|
|
by eatonphil
1063 days ago
|
|
Yeah I'm saying why does your lexer actually need to be UTF-8 aware? (An actual question, because maybe I'm not thinking of some obvious case.) Most of the lexical/syntactic elements of languages are not in UTF-8. You're looking for things like semicolons and quotes and whitespace. If you don't change the language syntax/lexical elements so that those parts stay as the ASCII subset of UTF-8 then why does your lexer need to be aware of UTF-8? It can just accumulate everything else as bytes and it doesn't matter what format the bytes are. The parser and/or codegen will do equality checks for lookups later on but that doesn't need to be UTF-8 aware either? Am I missing something? |
|