|
|
|
|
|
by HelloNurse
1896 days ago
|
|
I don't like the YACC style of code fragments with placeholders, but the system seems well designed and it's likely to be good enough in practice. But seriously, not being able to parse text is more than enough of a limitation for a text parsing tool. The token specification keeps close to standard regular expression, and matching Unicode text with regular expression (https://unicode.org/reports/tr18/) is a rather well researched problem with good implementations. |
|
If you're so inclined, examine rex.c, and you'll see (e.g. rex_nfa_make_ranged_trans() for example) that the engine internally works with ranges of uint32 for this very unicode reason.
The front-end regex parser and driver code, however, are not there yet, so prior to code emission, these beautiful ranges of uint32 codepoints are back-translated into rote uint8 tables. Such is the fate of wanting to ship. It'll come.