|
|
|
|
|
by osaariki
1200 days ago
|
|
We built the new engine behind .NET's RegexOptions.NonBacktracking with derivatives. We will have a paper at PLDI this year on the work that went into that. PCRE semantics was indeed the big thing that required new techniques. Basically, you have to modify the derivation function such that the derivatives model what paths through the pattern a backtracking engine would consider before stopping at the current position. The big thing derivatives buys you is the ability to apply rewrite rules lazily during the construction. For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it. These kinds of rewrites often result in the DFA that gets constructed being minimal or close to so. Of course, you do pay somewhat for the machinery to do this, so best-case construction times suffer compared to in traditional NFA+lazy-DFA engines like RE2 or Rust's, but a larger class of patterns get to stay in the DFA world with derivatives. I hope our work ignites a new interest in regex matching with derivatives. I believe the ability to apply these syntactic rewrites on-the-fly is really powerful and I'd love to see how far folks like you with extensive experience in optimizing regex matchers can take this. |
|
> For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it.
This doesn't compose though, right? For example, if you have `sam|samwise`, then you can do that transformation, but if you have `\b(?:sam|samwise)\b` then you can't.
> but a larger class of patterns get to stay in the DFA world with derivatives.
Can you say more about this?