|
|
|
|
|
by someplaceguy
100 days ago
|
|
> the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression The authors seem to claim linear complexity: > the result is RE#, the first general-purpose regex engine to support intersection and complement with linear-time guarantees, and also the overall fastest regex engine on a large set of benchmarks |
|
The standard way to do intersection / complementation of regexes with NFAs requires determinization, which causes a huge blowup, whereas for us this is the cost of a derivative.
It is true that we cannot avoid enormous DFA sizes, a simple case would be (.*a.*)&(.*b.*)&(.*c.*)&(.*d.*)... which has 2^4 states and every intersection adds +1 to the exponent.
How we get around this in the real world is that we create at most one state per input character, so even if the full DFA size is 1 million, you need an input that is at least 1 million characters long to reach it.
The real argument to complexity is how expensive can the cost of taking a lazy derivative get? The first time you use the engine with a unique input and states, it is not linear - the worst case is creating a new state for each character. The second time the same (or similar) input is used these states are already created and it is linear. So as said in the article it is a bit foggy - Lazy DFAs are not linear but appear as such for practical cases