|
|
|
|
|
by burntsushi
1204 days ago
|
|
Oh nice! Unicode is definitely something that's on my mind when thinking about derivatives and how to deal with them, but it sounds like ocaml-re is doing pretty well outside of Unicode. I would love to hook it up to my benchmark harness. (It isn't public yet... Hopefully soon. But it supports regexes in any languages. So far I have Rust, C, C++, Python and Go. I hope to add .NET, Perl and Node at least. But this might be a cool addition too.) If anyone wants to add this Ocaml engine to the harness (or any other engine), please email me at jamslam@gmail.com and I'll give access to the repo. The only reason it isn't public yet is because I'm still working on the initial release and iterating. But it's close enough where other people could submit benchmark programs for other regex engines. |
|
It's actually easy to compute classes of characters that have the same derivative (it's done in the linked "Regular-expression derivative re-examined" paper, although their particular implementation favors simplicity over efficiency), and it's not even necessary when using Antimirov's partial derivatives.
Actually, the complexity of the derivation is independent of the size of the alphabet. You could even define derivation on an arbitrary semi-lattice, not necessarily a set of characters. (Or a boolean algebra if you care about negation/complementation).
The difficulty in handling unicode has more to do with the efficiency of the automaton representation and manipulation rather than in turning the RE in an NFA or DFA.