|
|
|
|
|
by glangdale
2690 days ago
|
|
The idea of compiling everything together into a single DFA is very old. However, this doesn't always work - it will result in a combinatorial explosion in many cases (informally, if the patterns contain a lot of states that overlap with other potential states you might be in from other patterns or elsewhere in that pattern). So DFA merging is a dark art and many people spend whole careers taping things to the side of the basic DFA algorithm to handle this. This doesn't seem to come up with most lexers because the idea of using some crazy overlapping tokens doesn't sound great - i.e. you're probably not going to want to allow your tokens to overlap that much. So it's not a problem for lex, but it might be a problem if you're scanning 10,000 text analytics patterns simultaneously. Concatenating patterns is fine for RE2 - and RE2 can survive this - to a point. However, RE2 just falls back to a 'survivable' Thompson NFA regex when determinization fails and has no other strategies. For backtracking engines, concatenating patterns with "|" is equivalent to just trying one pattern, then the next, etc. |
|