Hacker News new | ask | show | jobs
by eyelidlessness 761 days ago
FWIW, it seems pretty likely the same multi-NFA algorithm described in the article could be applied to other backtracking scenarios (and therefore to back references and lookaround). Since I’m just speculating, I’ll hazard a guess that it would result in a meaningfully more complex implementation to derive each state machine, and potentially much slower (while still “linear” complexity).
1 comments

As the OP specifically and explicitly says, back-references are definitively NP-complete. The OP even links to a proof[1]. If you found a way to implement them in worst case linear time, then I believe you would have discovered a constructive proof for p=np.

Look-around is a different story, but I don't believe you can still guarantee linear time.

[1]: https://perl.plover.com/NPC/NPC-3SAT.html

I also wanted to post that link, but now that I’m rereading it I’m starting to doubt that it proves what we want it to prove here. The way it’s presented, the size of the 3-SAT formula corresponds to the size of the expression, not the size of the haystack; and that compiling a regex is exponential in its length is not exactly a surprise—the standard determinization procedure used to compile actually regular (no backreferences) regexes for linear matching is also exponential. So what we’d actually want to prove is along the lines of NP-completeness even if we allow for an exponential amount of preprocessing on the expression only.

(I’m willing to believe backreference matching is NP-complete, I just think the linked statement is weaker than that.)

Ah I see. Geoff Langdale (creator of Hyperscan) has a post about this. The comments seem relevant: https://branchfree.org/2019/04/04/question-is-matching-fixed...
Sat solvers have been implemented on backtracking regex engines. Sat is NP thus game over for looking for linear time solutions to arbitrary backtracking regex. Sadly.
> Sat solvers have been implemented on backtracking regex engines.

Again, if the (not-really-)regular expression grows as the SAT problem does, that doesn’t mean game over yet. Your favourite ahead-of-time regex compiler (the good interactive ones are usually incremental, but lex/flex or re2c certainly fit) is also exponential in the size of the regex, because determinization can have exponential-size output. After the (exponentially large) compiled form is produced, though, matching is linear in the size of the haystack no matter what it is.

Burntsushi’s sibling reply links to blog post and thence to a proof-of-concept regular+backreferences expression matcher[1] that works in (quite miserable but still) polynomial time wrt the haystack, with the degree of the polynomial depending on the number of groups that have backreferences to them.

That’s not exactly fantastic, mind you—recall that you can match arbitrary context-free languages in “merely” cubic time. But it’s not exponential.

[1] https://github.com/travisdowns/polyregex

Attempting polynomial with respect to the string being searched for a constant regex is interesting. A finite regex will contain a finite number of back references.

I think a finite number of back references implies a maximum stack use for the backtracking approach. DFA plus finite size stack machine is convertible to a DFA with a smaller stack and onwards to a pure DFA.

That is probably worth trying to implement, thank you for the push.

Look-arounds can be implemented in quadratic time for unbounded expressions (i.e: containing +, *), and linear time for bounded expressions quite easily. And I suspect they can be implemented in (super)linear time in general by matching them in parallel to the NFA.