Hacker News new | ask | show | jobs
by mananaysiempre 761 days ago
I also wanted to post that link, but now that I’m rereading it I’m starting to doubt that it proves what we want it to prove here. The way it’s presented, the size of the 3-SAT formula corresponds to the size of the expression, not the size of the haystack; and that compiling a regex is exponential in its length is not exactly a surprise—the standard determinization procedure used to compile actually regular (no backreferences) regexes for linear matching is also exponential. So what we’d actually want to prove is along the lines of NP-completeness even if we allow for an exponential amount of preprocessing on the expression only.

(I’m willing to believe backreference matching is NP-complete, I just think the linked statement is weaker than that.)

2 comments

Ah I see. Geoff Langdale (creator of Hyperscan) has a post about this. The comments seem relevant: https://branchfree.org/2019/04/04/question-is-matching-fixed...
Sat solvers have been implemented on backtracking regex engines. Sat is NP thus game over for looking for linear time solutions to arbitrary backtracking regex. Sadly.
> Sat solvers have been implemented on backtracking regex engines.

Again, if the (not-really-)regular expression grows as the SAT problem does, that doesn’t mean game over yet. Your favourite ahead-of-time regex compiler (the good interactive ones are usually incremental, but lex/flex or re2c certainly fit) is also exponential in the size of the regex, because determinization can have exponential-size output. After the (exponentially large) compiled form is produced, though, matching is linear in the size of the haystack no matter what it is.

Burntsushi’s sibling reply links to blog post and thence to a proof-of-concept regular+backreferences expression matcher[1] that works in (quite miserable but still) polynomial time wrt the haystack, with the degree of the polynomial depending on the number of groups that have backreferences to them.

That’s not exactly fantastic, mind you—recall that you can match arbitrary context-free languages in “merely” cubic time. But it’s not exponential.

[1] https://github.com/travisdowns/polyregex

Attempting polynomial with respect to the string being searched for a constant regex is interesting. A finite regex will contain a finite number of back references.

I think a finite number of back references implies a maximum stack use for the backtracking approach. DFA plus finite size stack machine is convertible to a DFA with a smaller stack and onwards to a pure DFA.

That is probably worth trying to implement, thank you for the push.