| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Drup 1203 days ago
	ocaml-re[1] uses a derivative-style construction to lazily build a DFA. The general idea is to use something similar to Owens et al's DFA construction, but doing it inline with some caching, to compile lazily (and building a thomson-like automaton, for group capture). In practice, it is fairly fast, although not as optimized as the Rust crate. :) Derivatives supports several match semantics very easily (ocaml-re does longest, shortest, greedy, non-greedy, first). It indeed doesn't handle unicode matching though (it's possible, there is a prototype implem, but nobody took the time to push it through). Note that it's not difficult to (lazily or not) build a NFA using derivatives as well (with Antimirov's construction). [1]: https://github.com/ocaml/ocaml-re/

1 comments

burntsushi 1203 days ago

Oh nice! Unicode is definitely something that's on my mind when thinking about derivatives and how to deal with them, but it sounds like ocaml-re is doing pretty well outside of Unicode. I would love to hook it up to my benchmark harness. (It isn't public yet... Hopefully soon. But it supports regexes in any languages. So far I have Rust, C, C++, Python and Go. I hope to add .NET, Perl and Node at least. But this might be a cool addition too.)

If anyone wants to add this Ocaml engine to the harness (or any other engine), please email me at jamslam@gmail.com and I'll give access to the repo. The only reason it isn't public yet is because I'm still working on the initial release and iterating. But it's close enough where other people could submit benchmark programs for other regex engines.

link

def-lkb 1203 days ago

I don't think you should be worried about Unicode in particular. Although the derivation formula on paper is parameterized by a character, you don't have to compute the derivative for every character separately.

It's actually easy to compute classes of characters that have the same derivative (it's done in the linked "Regular-expression derivative re-examined" paper, although their particular implementation favors simplicity over efficiency), and it's not even necessary when using Antimirov's partial derivatives.

Actually, the complexity of the derivation is independent of the size of the alphabet. You could even define derivation on an arbitrary semi-lattice, not necessarily a set of characters. (Or a boolean algebra if you care about negation/complementation).

The difficulty in handling unicode has more to do with the efficiency of the automaton representation and manipulation rather than in turning the RE in an NFA or DFA.

link

burntsushi 1203 days ago

Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)

It has been a long time since I read the "Regular-expression derivative re-examined" derivative paper. Mostly the only thing I remember at this point is that I came away thinking that it would be difficult to adapt in practice for large Unicode classes. But I don't remember the details.

It is honestly very difficult for me to translate your comment here into an actionable implementation strategy. But that's probably just my inexperience with derivatives talking.

link

def-lkb 1203 days ago

> Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)

I don't know any besides ocaml-re that Drup already linked, sorry :).

And sorry that my comment is hard to decipher. I think the core point is that the "character set" can be an abstract type from the point of view of the derivation algorithm. So it doesn't matter how they are represented, nor "how big" a character set is.

With Antimirov's derivative (which produces an NFA), there is no constraint on this type.

With Brzozowski's derivative, you need at least the ability to intersect two character sets. So the type should implement a trait with an intersection function (in Rust syntax, `trait Intersect fn intersect(self, Self) -> Self`). That's necessary for any implementation generating a DFA anyway.

And if you also want to deal with complementation, then a second method `fn negate(self) -> Self` is necessary.

link

burntsushi 1203 days ago

Thanks! You might be right. I'm probably at a point where I'd have to actually go out and try it to understand it better.

I do wonder if there is some room for derivatives in a meta regex engine (like RE2 or the regex crate). For example, if it let you build a DFA more quickly (in practice, not necessarily in theory), then you might be able to use it for a big subset of cases. It's tricky to make that case over the lazy DFA, however, a full DFA has more optimization opportunities. For example, identifying states with very few outgoing transitions and "accelerating" them by running memchr (or memchr2 or memchr3) on those outgoing transitions instead of continuing to walk the automaton. It's really hard to do that with a lazy DFA because you don't really compute entire states up front.

link

def-lkb 1203 days ago

I think what you suggest is possible, derivation might even be well suited for this application, however I can't tell if it would be better than existing approaches. There are some chances that it might be interesting in practice, since it seems that this application of derivatives has not been much studied, but that's highly speculative.

link

Drup 1203 days ago

Having a good quality and curated regex benchmarks would be quite useful! I hope you plan on having several features, and engines that can only have partial support. That would make for very interesting comparisons.

link

burntsushi 1203 days ago

It does. And more. The only thing you have to do is provide a short program that parses the description of the benchmark on stdin, and then output a list of samples that consist of the time it took to run a single iteration and the "result" of the benchmark for verification. The harness takes over from there. There's no need to have any Unicode support at all. I even have a program for benchmarking `memmem`, which is of course not a regex engine at all.

link