| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by profquail 2006 days ago
	I wrote an implementation of this several years back. If you’re interested in the code: https://github.com/jack-pappas/facio/tree/master/Reggie The derivatives approach makes Unicode support easier since its able to keep the symbols sets for each transition edge (in the DFA) more compact by virtue of supporting negation. If you add in aggressive term-normalization, hash-consing, and an efficient dense-set implementation (all of which I’ve done in my implementation), the derivatives approach can be extremely fast, even when generating the DFA for something like the lexer of a full programming language (in my case, F#).

1 comments

chubot 2006 days ago

Very cool! And thanks for the reminder about Unicode. I think supporting union and intersection is also somewhat unique to the derivatives method, and also related? (Although I think there are really 2 derivatives methods: Brzozowski and Antimirov)

What happens if you don't do the optimizations? Does the DFA blow up in size, meaning the compile time is large? Or does it make for a slower runtime? I would expect most DFAs to run at about the same speed, unless they are really huge...

I'd be interested in any rough ideas about performance, e.g. how fast a realistic lexer+parser is, maybe in lines/ms.

It does look like the code is pretty short -- a large part of it is an AVL tree library I guess for hash consing?

I'm interested in any downsides of the derivatives technique vs. the NFA->DFA method. I feel like regex compile time shouldn't matter for many applications, and most DFAs will run in the same speed, which only leaves runtime memory usage (or code size for generating F# code like you appear to be doing).

link

profquail 2006 days ago

The optimizations get you the following:

* Normalization: this is where "smart constructors" come in handy; having a normal form for the terms allows the caching to work better. This also impacts the compactness of the generated DFA. * Hash-consing: this turns structural equality (in this case) to a simple pointer equality; applied recursively, this makes it much faster to compare two terms for equality, and overall speeds up the DFA generation by a non-trivial amount (I forget the exact numbers, but it was significant). * Dense set implementation: The AVL tree-based data structure in the facio/Reggie code is an implementation of the Discrete Interval Encoding Tree (DIET) data structure from "Diets for fat sets" and "More on Balanced Diets" papers.

Note the optimizations I've mentioned here impact the performance of generating the DFA. Once you have the DFA, it'll run at the same speed as one generated in any other way. Part of the motiviation for my writing this library was to learn about regex/DFAs/grammars, but also to try to improve on the performance of fslex/fsyacc at the time. Using this library, the FSharpLex tool can generate the DFA for the full F# language grammar in well under 1 sec; the code generation takes a bit longer, largely due to having to convert the DFA into a different form for backwards-compatibility with fslex.

Overall, I feel like the derivatives technique is generally better and simpler, and I'm not aware of any real downsides. The only one that comes to mind is if you're wanting to implement things like backreferences and capture groups -- those obviously make the implementation (of the DFA) more complicated, and there's a lot less literature on it (last I saw, maybe only one or two papers on implementing those features on top of a derivatives-based regex engine).

link

chubot 2006 days ago

Great info, thanks! I think derivatives could be very suited for a compact implementation of POSIX regular expressions. You need to handle unicode classes but not backreferences!

Although it does seem more suited for functional languages for sure, whereas I basically only have a C runtime.

Capturing might be an issue. I found this

https://www.home.hs-karlsruhe.de/~suma0002/publications/posi...

but I think it's actually being a bit pedantic, i.e. if "almost all POSIX implementations are buggy" then applications don't rely on that exact semantic (they probably rely on the buggy one, if anything ...)

Maybe more relevant: http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp1...

link