|
|
|
|
|
by glangdale
2643 days ago
|
|
Unicode is a PITA. In Hyperscan, it's not pretty what gets generated for a bare \w in UCP mode if you force it into an NFA (it's rather more tractable as a DFA, even if you aren't lazily generating, although of course betting the farm that you can always 'busily' generate a DFA isn't great). I've always thought that a better job of doing NFAs (Gluskov or otherwise) and staying bit-parallel would be done with having character reachability on codepoints, not bytes, generally remapping down to 'which codepoints make an actual difference'. This sounds ugly/terrifying, but the nice thing is that remapping a long stream of codepoints could be done in parallel (as it's not hard to find boundaries) and with SIMD. Step by step NFA or DFA work is more ponderous as every state depends on previous states. |
|
And of course, one doesn't need to bet the farm on a lazy DFA if you have one, although it is quite robust in a large number of practical scenarios. (I think RE2 does bet the farm, to be fair.)