| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HelloNurse 1896 days ago

Come on, the seventies have ended...

> Note that the above implies there is currently no support for features like:

> UTF-8 (or other unicode) input, input characters are all deemed to be in the range 0 to 255.

1 comments

quincunx 1896 days ago

Thanks for studying this, your emphasis is noted - aside from this particular bullet from the "notable limitations," are there any other short-comings that you feel are deal breakers for you?

link

HelloNurse 1896 days ago

I don't like the YACC style of code fragments with placeholders, but the system seems well designed and it's likely to be good enough in practice.

But seriously, not being able to parse text is more than enough of a limitation for a text parsing tool. The token specification keeps close to standard regular expression, and matching Unicode text with regular expression (https://unicode.org/reports/tr18/) is a rather well researched problem with good implementations.

link

quincunx 1896 days ago

Thanks. It's not that UTF-8 is not on the list, it's always been on the list, it's just not there yet. Hence I felt the need to stipulate the lack of it in the manual, because of its importance.

If you're so inclined, examine rex.c, and you'll see (e.g. rex_nfa_make_ranged_trans() for example) that the engine internally works with ranges of uint32 for this very unicode reason.

The front-end regex parser and driver code, however, are not there yet, so prior to code emission, these beautiful ranges of uint32 codepoints are back-translated into rote uint8 tables. Such is the fate of wanting to ship. It'll come.

link