| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by glangdale 2645 days ago

I suspect Thompson's NFA is not inherently dog slow (Glushkov can be done reasonably fast for decent-sized NFAs). The fact is that most Thompson-lineage engines opted for the 'lazy DFA' approach and optimized that (which is effective until it isn't). I imagine a more aggressive 'native' Thompson NFA is possible. A nice benefit of that is not having to write to your bytecode - there's a good deal of systems-level complexity stuff in RE2 that springs out of a consequence of the 'lazy DFA construction' decision.

That being said, matching literals is always going to be faster, especially if you decompose the pattern to get more use out of your literal matcher - the downside of filtration is that if the literal is always present, you are just doing strictly more work. At least with decomposition you've taken the literal out of the picture. See https://branchfree.org/2019/02/28/paper-hyperscan-a-fast-mul... for those who don't know what I'm talking about (I know you've read it).

Am flirting with doing another regex engine that gets some of the benefit of decomposition and literal matching without taking on the nosebleed complexity of Hyperscan...

1 comments

burntsushi 2645 days ago

Do you know of any fast Thompson NFA simulation implementation? I don't think I've seen one outside of a JIT.

Is there a fast glushkov implementation that isn't bit parallel? I've never been able to figure out how to use bit parallel approaches with large Unicode classes. Just using a single Unicode aware \w puts it into the weeds pretty quickly. That's where the lazy DFA shines, because it doesn't need to build the full DFA for \w (which is quite large, even after the standard DFA compression tricks).

link

glangdale 2645 days ago

Unicode is a PITA. In Hyperscan, it's not pretty what gets generated for a bare \w in UCP mode if you force it into an NFA (it's rather more tractable as a DFA, even if you aren't lazily generating, although of course betting the farm that you can always 'busily' generate a DFA isn't great).

I've always thought that a better job of doing NFAs (Gluskov or otherwise) and staying bit-parallel would be done with having character reachability on codepoints, not bytes, generally remapping down to 'which codepoints make an actual difference'. This sounds ugly/terrifying, but the nice thing is that remapping a long stream of codepoints could be done in parallel (as it's not hard to find boundaries) and with SIMD. Step by step NFA or DFA work is more ponderous as every state depends on previous states.

link

burntsushi 2644 days ago

Yeah, I've looked at glushkov based primarily on your comments about it, but Unicode is always where I get stuck. In my regex engine, Unicode is enabled by default and \w is fairly common, so it needs to be handled well.

And of course, one doesn't need to bet the farm on a lazy DFA if you have one, although it is quite robust in a large number of practical scenarios. (I think RE2 does bet the farm, to be fair.)

link

glangdale 2644 days ago

Unicode + UCP is a perfectly principled thing, but it wasn't a design point that made any sense for Hyperscan as a default. The bulk of our customers were not interested in turning 1 state for ASCII \w into 600 states for UCP \w unless it was free.

I think both Glushkov and Thompson can be done fast, but I agree that they are both going to be Really Big for UCP stuff. Idle discussions among the ex-Hyperscan folks generally leans towards 'NFA over codepoints' being the right way of doing things.

Occam's razor suggests if you do only 1 thing in a regex system (i.e. designing for simplicity/elegance, which would be an interesting change after Hyperscan) it must be NFA, as not all patterns determinize. If you are OK with a lazy DFA system that can be made to create a new state per byte of input (in the worst case) then I guess you can do that too.

I am not sure how to solve the problem of "NFA over codepoints", btw. Having no more than 256 distinct characters was easy, but even with remapping, the prospect of having to handle arbitrary Unicode is... unnerving.

link

burntsushi 2643 days ago

Yeah, my Thompson NFA uses codepoints for those reasons. But not in particularly smart way; mostly just to reduce space usage. It is indeed an annoying problem to deal with!

link

glangdale 2645 days ago

... and no, I don't know of any fast Thompson NFA simulations, but I don't see why they shouldn't be possible. They have a very simple "next" function, modulo the awfulness of getting past epsilons, but that seems to be roughly parallel to the awfulness of computing arbitrary 'next' functions in Glushkov-land. I'm not aware of anyone that's actually tried.

link