Hacker News new | ask | show | jobs
by TRiG_Ireland 1068 days ago
> The line number you could get without being UTF-8 aware.

Can you? Unicode has the following "new line" characters:

* U+000A Line Feed (LF) alone

* U+000D Carriage Return (CR) alone

* CRLF as one indivisible sequence

* U+000B Line Tabulation (VT) — supporting this is explicitly optional, and the main standard's newline function definition does not include it

* U+000C Form Feed (FF)

* U+0085 Next Line (NEL), an EBCDIC round-trip compatibility character

* U+2028 Line Separator (LS)

* U+2029 Paragraph Separator (PS)

My source: https://langdev.stackexchange.com/a/590/717

1 comments

Yes whitespace in unicode is expansive. However, you could (and I assume most languages do) specify that a newline is \n or \r\n which are expressible in ASCII.

Maybe I'm wrong though, just an assumption about what's common.

(See for example how Go, which is Unicode aware, defines tokens: https://go.dev/ref/spec#Tokens.)