Hacker News new | ask | show | jobs
by HelloNurse 1896 days ago
I don't like the YACC style of code fragments with placeholders, but the system seems well designed and it's likely to be good enough in practice.

But seriously, not being able to parse text is more than enough of a limitation for a text parsing tool. The token specification keeps close to standard regular expression, and matching Unicode text with regular expression (https://unicode.org/reports/tr18/) is a rather well researched problem with good implementations.

1 comments

Thanks. It's not that UTF-8 is not on the list, it's always been on the list, it's just not there yet. Hence I felt the need to stipulate the lack of it in the manual, because of its importance.

If you're so inclined, examine rex.c, and you'll see (e.g. rex_nfa_make_ranged_trans() for example) that the engine internally works with ranges of uint32 for this very unicode reason.

The front-end regex parser and driver code, however, are not there yet, so prior to code emission, these beautiful ranges of uint32 codepoints are back-translated into rote uint8 tables. Such is the fate of wanting to ship. It'll come.