Hacker News new | ask | show | jobs
by iron2disulfide 1320 days ago
> Most don't

Is this true? I'm pretty sure that most of the regex engines I've used (grep, ripgrep, re2, hyperscan) use Thompson's construction or at least some NFA-based algorithm (not necessarily Thomopson's; Glushkov automata in particular are used by hyperscan).

1 comments

Yes, all of those are finite automata based. Although grep usually uses the POSIX system regex library. GNU grep will use its own finite automata based approach for a subset (a large subset) of regexes.

But most regex engines are backtracking based. Perl, PCRE (used by PHP and probably others), Oniguruma/Onigmo (used by Ruby), Javascript, Java, Python. That covers a lot of ground.

Plus, popular reference web sites like https://www.regular-expressions.info/ are heavily biased toward regex engines that have features typically only implemented by backtracking engines.

There's also postgres's which descends from spencer's (Tcl Advanced Regular Expressions) and supports backreferences but IME is very hard to catch into catastrophic backtracking.

Possibly because you need to use the feature to trigger the NFA mode and the baseline test case for catastrophic backtracking doesn't, it just assumes the engine will backtrack.

GNU grep historical design discussion:

https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

Modern GNU grep includes optional PCRE2 support. I incorrectly recalled that it skipped NFA->DFA conversion, but that maybe how something else like Go or re2 work in certain cases.

https://git.savannah.gnu.org/cgit/grep.git/tree/src

Most people in tech don't seem to grasp that there are very few compatible/identical regex formats. Hell, most people don't know what are the comprehensive list of code points characters classes include because they're poorly doc'd or undocumented. I had to write some scripts to find them for certain languages: Ruby, Python, and Rust. I advise people to never reinvent regex or Unicode parsing themselves because there are far too many security issues and edge cases that will inevitably become problems.

You're speaking to the author of Rust's regex engine.

> Hell, most people don't know what are the comprehensive list of code points characters classes include because they're poorly doc'd or undocumented. I had to write some scripts to find them for certain languages: Ruby, Python, and Rust.

Can you say more? All of the classes are documented here for the regex crate: https://docs.rs/regex/latest/regex/#syntax

Any not listed there are from Unicode and defined by Unicode.

> I advise people to never reinvent regex or Unicode parsing themselves because there are far too many security issues and edge cases that will inevitably become problems.

So what should have I done instead?

I generally advise people never to say "never reinvent something," because that stifles innovation and progress.

> Modern GNU grep includes optional PCRE2 support. I incorrectly recalled that it skipped NFA->DFA conversion, but that maybe how something else like Go or re2 work in certain cases.

Not quite sure what you're trying to say here, but see: https://news.ycombinator.com/item?id=33567129