| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by glangdale 2691 days ago

> 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.

This is a good idea. It's such a good idea that we did it in 2006, sold a company to Intel based around in in 2013 and open-sourced the resulting library (https://github.com/intel/hyperscan) in 2015.

All multi-pattern systems get a huge improvement over one-at-a-time regex. It would be embarrassing if they did not. Our rule of thumb for Hyperscan was that it should be about 5-10x faster than libpcre on single patterns and the gap should get progressively bigger.

I don't doubt that there is quite a bit of interesting work (entirely outside of regex - the word-based focus looks interesting) going on in this project, but the fact that you couldn't be bothered to Google around for a sensible baseline is not encouraging.

Also, which bit of your system do you think is novel enough to patent? You may find considerable prior art out there, which you don't really seem to be aware of based on how much emphasis you're placing on the not-entirely-novel idea of using a state machine.

2 comments

ychetyrko 2690 days ago

Geoff, for me it's real honor that you, well-known creator of Hyperscan, paid attention to what we are doing. We truly respect all the prior work in this field (Hyperscan, RE2, Rust, etc), so different benchmarks will be available.

However, we all need to keep in mind that Nevod operates on words, not characters, and Nevod targets large rule decks (1000-100000 patterns). So, we don't compete in the field of individual pattern matching or streaming matching (to some extent, goals of Hyperscan and Nevod are different). In other words, a lot of things depend on target goals, which usually are reflected in the kind of patterns and in their number.

Regarding patent. Sorry, for legal reasons I can't expose details at the moment.

Thank you for your valuable comments.

link

glangdale 2690 days ago

I don't think I'm all that well-known. :-)

Staying away from streaming is a good idea for a lot of designs. We all lost a lot of our hair making streaming work in Hyperscan, and it was often the thing suppressing easier optimizations. Or at least we would have to have a separate approach for streaming and block mode; not exactly great s/w engineering.

Word-based approaches are interesting. So is large scale. Hyperscan does embarrassingly badly on natural language stuff at scale due to some baked in assumptions about relative frequencies of literal factors and the literal matchers tend to fall apart on really huge literal sets in this domain. So there's definitely an opportunity here.

I would strongly encourage you to produce some repeatable benchmarks (at least, things where we can see what the task is for something like RE2 or Hyperscan). We spent many years in stealth mode with a lot of superlatives and NDA-only Powerpoint decks and I don't think it did ourselves any favors. It means that you wind up being the only guardian of your own honesty, which is ... tricky ("What the hell is a medium-complexity pattern, anyhow?").

link

ychetyrko 2688 days ago

Based on what you are saying I'm concluding that we are on the right path. Again, thank a lot for all the valuable feedback and advice!

link

nostrademons 2691 days ago

Isn't this basically what regexp-based lexer generators (eg. lex, 1975) do? Stdlib regexps need to run multiple passes because the library itself has no idea whether it's being used with one pattern or multiple patterns, but any lexer generator that has all the patterns at once (i.e. all of them) can merge the DFAs generated and match the input in a single pass.

Hell, I've been known to do quick & dirty versions of this technique by concatenating all the patterns together with "|" before passing them to Pattern.compile. It takes all of one line of code. Backtracking regexp engines don't really benefit from this, but DFA-based regexp engines like RE2 can get massive speedups.

link

glangdale 2690 days ago

The idea of compiling everything together into a single DFA is very old. However, this doesn't always work - it will result in a combinatorial explosion in many cases (informally, if the patterns contain a lot of states that overlap with other potential states you might be in from other patterns or elsewhere in that pattern). So DFA merging is a dark art and many people spend whole careers taping things to the side of the basic DFA algorithm to handle this.

This doesn't seem to come up with most lexers because the idea of using some crazy overlapping tokens doesn't sound great - i.e. you're probably not going to want to allow your tokens to overlap that much. So it's not a problem for lex, but it might be a problem if you're scanning 10,000 text analytics patterns simultaneously.

Concatenating patterns is fine for RE2 - and RE2 can survive this - to a point. However, RE2 just falls back to a 'survivable' Thompson NFA regex when determinization fails and has no other strategies.

For backtracking engines, concatenating patterns with "|" is equivalent to just trying one pattern, then the next, etc.

link