| > The line number you could get without being UTF-8 aware. Can you? Unicode has the following "new line" characters: * U+000A Line Feed (LF) alone * U+000D Carriage Return (CR) alone * CRLF as one indivisible sequence * U+000B Line Tabulation (VT) — supporting this is explicitly optional, and the main standard's newline function definition does not include it * U+000C Form Feed (FF) * U+0085 Next Line (NEL), an EBCDIC round-trip compatibility character * U+2028 Line Separator (LS) * U+2029 Paragraph Separator (PS) My source: https://langdev.stackexchange.com/a/590/717 |
Maybe I'm wrong though, just an assumption about what's common.
(See for example how Go, which is Unicode aware, defines tokens: https://go.dev/ref/spec#Tokens.)