|
|
|
|
|
by def-lkb
1194 days ago
|
|
> Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-) I don't know any besides ocaml-re that Drup already linked, sorry :). And sorry that my comment is hard to decipher. I think the core point is that the "character set" can be an abstract type from the point of view of the derivation algorithm.
So it doesn't matter how they are represented, nor "how big" a character set is. With Antimirov's derivative (which produces an NFA), there is no constraint on this type. With Brzozowski's derivative, you need at least the ability to intersect two character sets. So the type should implement a trait with an intersection function (in Rust syntax, `trait Intersect fn intersect(self, Self) -> Self`). That's necessary for any implementation generating a DFA anyway. And if you also want to deal with complementation, then a second method `fn negate(self) -> Self` is necessary. |
|
I do wonder if there is some room for derivatives in a meta regex engine (like RE2 or the regex crate). For example, if it let you build a DFA more quickly (in practice, not necessarily in theory), then you might be able to use it for a big subset of cases. It's tricky to make that case over the lazy DFA, however, a full DFA has more optimization opportunities. For example, identifying states with very few outgoing transitions and "accelerating" them by running memchr (or memchr2 or memchr3) on those outgoing transitions instead of continuing to walk the automaton. It's really hard to do that with a lazy DFA because you don't really compute entire states up front.