|
|
|
|
|
by slimsag
1926 days ago
|
|
I imagine this works for regex engines like RE2 that don't support backtracking, but would not be possible if supporting more advanced regex features like backtracking? I definitely have more to learn about automata and FSMs :) |
|
In practice, production grade general purpose regex engines are never implemented with formal DFAs in all cases. They take too long to build and use way too much memory, especially when Unicode is involved. For example:
And this is including oodles of tricks for shrink the DFA's size. (Note though that the above examples minimize the DFA, which adds significant time. Without minimization, the latter is about an order of magnitude faster.)So even this simple regex is 3MB in size.
Finally, DFAs can grow exponentially in the size of the regex, which means you can't really use them with untrusted regexes.
In practice, regex engines like RE2 use a hybrid NFA/DFA, or "lazy DFA." The DFA is computed at search time and cached. If too much cache is used, then the regex engine falls back to an NFA simulation.
Beating the performance of a DFA without going into more exotic things like vectorized algorithms or more specialized regex engines (bit-parallel NFAs) is pretty tricky.
And also, do not repeat the same mistake that the OP seems to make: conflating what DFA and NFA mean. A lot of people like to call things like PCRE an "NFA" implementation, but it's not, because meaningfully more powerful than what an NFA can do. General purpose regex engines tend to be split between ones that are based on finite automata and ones that are based on backtracking.