Hacker News new | ask | show | jobs
by nostrademons 2691 days ago
Isn't this basically what regexp-based lexer generators (eg. lex, 1975) do? Stdlib regexps need to run multiple passes because the library itself has no idea whether it's being used with one pattern or multiple patterns, but any lexer generator that has all the patterns at once (i.e. all of them) can merge the DFAs generated and match the input in a single pass.

Hell, I've been known to do quick & dirty versions of this technique by concatenating all the patterns together with "|" before passing them to Pattern.compile. It takes all of one line of code. Backtracking regexp engines don't really benefit from this, but DFA-based regexp engines like RE2 can get massive speedups.

1 comments

The idea of compiling everything together into a single DFA is very old. However, this doesn't always work - it will result in a combinatorial explosion in many cases (informally, if the patterns contain a lot of states that overlap with other potential states you might be in from other patterns or elsewhere in that pattern). So DFA merging is a dark art and many people spend whole careers taping things to the side of the basic DFA algorithm to handle this.

This doesn't seem to come up with most lexers because the idea of using some crazy overlapping tokens doesn't sound great - i.e. you're probably not going to want to allow your tokens to overlap that much. So it's not a problem for lex, but it might be a problem if you're scanning 10,000 text analytics patterns simultaneously.

Concatenating patterns is fine for RE2 - and RE2 can survive this - to a point. However, RE2 just falls back to a 'survivable' Thompson NFA regex when determinization fails and has no other strategies.

For backtracking engines, concatenating patterns with "|" is equivalent to just trying one pattern, then the next, etc.