| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by someplaceguy 147 days ago

> the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression

The authors seem to claim linear complexity:

> the result is RE#, the first general-purpose regex engine to support intersection and complement with linear-time guarantees, and also the overall fastest regex engine on a large set of benchmarks

2 comments

ieviev 147 days ago

We refer to this in the paper as well,

The standard way to do intersection / complementation of regexes with NFAs requires determinization, which causes a huge blowup, whereas for us this is the cost of a derivative.

It is true that we cannot avoid enormous DFA sizes, a simple case would be (.*a.*)&(.*b.*)&(.*c.*)&(.*d.*)... which has 2^4 states and every intersection adds +1 to the exponent.

How we get around this in the real world is that we create at most one state per input character, so even if the full DFA size is 1 million, you need an input that is at least 1 million characters long to reach it.

The real argument to complexity is how expensive can the cost of taking a lazy derivative get? The first time you use the engine with a unique input and states, it is not linear - the worst case is creating a new state for each character. The second time the same (or similar) input is used these states are already created and it is linear. So as said in the article it is a bit foggy - Lazy DFAs are not linear but appear as such for practical cases

link

btown 147 days ago

> The second time the same (or similar) input is used these states are already created and it is linear.

Does this imply that the DFA for a regex, as an internal cache, is mutable and persisted between inputs? Could this lead to subtle denial-of-service attacks, where inputs are chosen by an attacker to steadily increase the cached complexity - are there eviction techniques to guard against this? And how might this work in a multi-threaded environment?

link

ieviev 147 days ago

Yes, most (i think all) lazy DFA engines have a mutable DFA behind a lock internally that grows during matching.

Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.

The subtle denial of service part is interesting, i haven't thought of it before. Yes this is possible. For security-critical uses i would compile the full DFA ahead of time - the memory cost may be painful but this completely removes the chance of anything going wrong.

There are valid arguments to switch from DFA to NFA with large state spaces, but RE# intentionally does not switch to a NFA and capitalizes on reducing the DFA memory costs instead (eg. minterm compression in the post, algebraic simplifications in the paper).

The problem with going from DFA to NFA for large state spaces is that this makes the match time performance fall off a cliff - something like going from 1GB/s to 1KB/s as we also show in the benchmarks in the paper.

As for eviction techniques i have not researched this, the simplest thing to do is just completely reset the instance and rebuild past a certain size, but likely there is a better way.

link

layer8 147 days ago

> Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.

But you also have to lock when reading the state, not just when writing/creating it. Wouldn’t that cause lock contention with sufficiently concurrent use?

link

ieviev 147 days ago

No, we do not lock reading the state, we only lock the creation side and the transition table reference stays valid during matching even if it is outdated.

Only when a nonexistent state is encountered during matching it enters the locked region.

link

layer8 147 days ago

Ah, I see, so it’s basically the Racy Single-Check Idiom.

link

ot 142 days ago

> are there eviction techniques to guard against this?

RE2 resets the cache when it reaches a (configurable) size limit. Which I found out the hard way when I had to debug almost-periodic latency spikes in a service I managed, where a very inefficient regex caused linear growth in the Lazy DFA, until it hit the limit, then all threads had to wait for its reset for a few hundred milliseconds, and then it all started again.

I'm not sure if dropping the whole cache is the only feasible mitigation, or some gradual pruning would also be possible.

Either way, if you cannot assume that your cache grows monotonically, synchronization becomes more complicated: the trick mentioned in the other comment about only locking the slow path may not be applicable anymore. RE2 uses RW-locking for this.

link

ieviev 142 days ago

I have experienced this as well, the performance degradation of DFA to NFA is enormous and while not as bad as exponential backtracking, it's close to ReDoS territory.

The rust version of the engine (https://github.com/ieviev/resharp) just returns an Error instead of falling back to NFA, I think that should be a reasonable approach, but the library is still new so i'm still waiting to see how it turns out and whether i had any oversights on this.

link

ot 142 days ago

Here RE2 does not fall back to the NFA, it just resets the Lazy DFA cache and starts growing it again. The latency spikes I was mentioning are due to the cost of destroying the cache (involving deallocations, pointer chasing, ...)

link

ieviev 141 days ago

Ah, sorry then i misunderstood the comment

I'm not sure if it's with both RE2 or Rust, but some internal engines of Rust appear to allocate a fixed buffer that it constantly re-creates states into.

I'm not really familiar with the eviction technique of RE2 but I've done a lot of benchmark comparisons. A good way to really stress test RE2 is large Unicode classes, \w and \d in RE2 are ascii-only, i've noticed Unicode (\p{class}) classes very drastically change the throughput of the engine.

link

mananaysiempre 147 days ago

These claims are compatible. For instance, lex, re2c, ragel, etc. are exponential in the length of the automaton description, but the resulting lexers work in linear time[1] in the length of the string. Here the situation is even better, because the DFA is constructed lazily, at most one state per input character, so to observe its full enormity relative to the size of the needle you need an equally enormous haystack. “One state per input character” somewhat understates things, because producing each state requires a non-constant[2] amount of work, and storing the new derivative’s syntax tree a non-constant amount of space; but with hash-consing and memoization it’s not bad. Either way, derivative-based matching with a lazy DFA takes something like O(needle + f(needle) × haystack) time where I’m guessing f(n) has to be O(n log n) at least (to bring large ORs into normal form by sorting) but in practice is closer to constant. Space consumption is less of an issue because if at any point your cache of derivatives (= DFA) gets too bloated you can flush it and restart from scratch.

[1] Kind of, unless you hit ambiguities that need to be resolved with the maximal munch rule; anyways that’s irrelevant to a single-RE matcher.

[2] In particular, introductions to Brzozowski’s approach usually omit—but his original paper does mention—that you need to do some degree of syntax-tree simplification for the derivatives to stay bounded in size (thus finite in number) and the matcher to stay linear in the haystack.

link