| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by def-lkb 1206 days ago

I don't think you should be worried about Unicode in particular. Although the derivation formula on paper is parameterized by a character, you don't have to compute the derivative for every character separately.

It's actually easy to compute classes of characters that have the same derivative (it's done in the linked "Regular-expression derivative re-examined" paper, although their particular implementation favors simplicity over efficiency), and it's not even necessary when using Antimirov's partial derivatives.

Actually, the complexity of the derivation is independent of the size of the alphabet. You could even define derivation on an arbitrary semi-lattice, not necessarily a set of characters. (Or a boolean algebra if you care about negation/complementation).

The difficulty in handling unicode has more to do with the efficiency of the automaton representation and manipulation rather than in turning the RE in an NFA or DFA.

1 comments

burntsushi 1206 days ago

Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)

It has been a long time since I read the "Regular-expression derivative re-examined" derivative paper. Mostly the only thing I remember at this point is that I came away thinking that it would be difficult to adapt in practice for large Unicode classes. But I don't remember the details.

It is honestly very difficult for me to translate your comment here into an actionable implementation strategy. But that's probably just my inexperience with derivatives talking.

link

def-lkb 1206 days ago

> Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)

I don't know any besides ocaml-re that Drup already linked, sorry :).

And sorry that my comment is hard to decipher. I think the core point is that the "character set" can be an abstract type from the point of view of the derivation algorithm. So it doesn't matter how they are represented, nor "how big" a character set is.

With Antimirov's derivative (which produces an NFA), there is no constraint on this type.

With Brzozowski's derivative, you need at least the ability to intersect two character sets. So the type should implement a trait with an intersection function (in Rust syntax, `trait Intersect fn intersect(self, Self) -> Self`). That's necessary for any implementation generating a DFA anyway.

And if you also want to deal with complementation, then a second method `fn negate(self) -> Self` is necessary.

link

burntsushi 1205 days ago

Thanks! You might be right. I'm probably at a point where I'd have to actually go out and try it to understand it better.

I do wonder if there is some room for derivatives in a meta regex engine (like RE2 or the regex crate). For example, if it let you build a DFA more quickly (in practice, not necessarily in theory), then you might be able to use it for a big subset of cases. It's tricky to make that case over the lazy DFA, however, a full DFA has more optimization opportunities. For example, identifying states with very few outgoing transitions and "accelerating" them by running memchr (or memchr2 or memchr3) on those outgoing transitions instead of continuing to walk the automaton. It's really hard to do that with a lazy DFA because you don't really compute entire states up front.

link

def-lkb 1205 days ago

I think what you suggest is possible, derivation might even be well suited for this application, however I can't tell if it would be better than existing approaches. There are some chances that it might be interesting in practice, since it seems that this application of derivatives has not been much studied, but that's highly speculative.

link