Hacker News new | ask | show | jobs
by jjice 761 days ago
This and the associated series of regex posts from Russ Cox are probably the best content on practical regular expression engine internals you can get. If you have a cursory understanding of NFAs and DFAs, it's some of the best technical writing I've ever had the pleasure to read.

These posts are responsible for my love of regular expressions :)

3 comments

Many years ago, as a fresh out of college young coder, I had to come up with a way to check types on an text input file in C++ -- long before C++ had even a standard String type let alone regex libraries like pcre. The standard conversion functions at the time expected you to have done some type detection beforehand.

Each field in the file could contain a string, an int, or a float. In particular to figure out the floats, I remember sitting down for a few hours and coming up with a regular expression, testing it in Perl on a bunch of test cases, and then went about the hard work of slowly converting the regex into an NFA to a DFA and then finally to an adjacency graph in a 2d matrix. A little bit of code to read in things a byte at a time and some simple arithmetic and I had a stupid fast

   bool isFloat(char *)
Nobody at my work understood the code, they just sort of knew that if they submitted a character array with a field value into it, something happened with a big matrix, and it you let them know if it was a float or not and then they could call the appropriate conversion function -- I think stof() or whatever was in fashion back then.

I had to make one revision to it for an edge case I missed and then it found its way into pretty much all code for the company after that.

The tuition for the class that taught me how to do that payed for itself with that one function.

And if you don't have a cursory understanding of NFAs and DFAs, they're still great for getting that cursory understanding.

These articles were my introduction to finite automata (and almost the only place I've learned about them aside from practical use) and my intuition for them seem to often be better than my peers'.

NFAs are apparently undecidable since the same input can go to multiple states. DFA states can only go to one state when given a particular input sequence.

The Dragon book contains an NFA to DFA algorithm (powerset) which explodes states with multiple exiting token transitions.

PS: All your DAGs are belong to us.

It is possible to implement NFAs directly, in parallel:

https://github.com/mike-french/myrex

Just fan-out messages to all downstream states. Let a fair runtime system evaluate all progress, then tally results along all paths, for example:

https://github.com/mike-french/myrex?tab=readme-ov-file#exec...

The approach can also be adapted to captures and their many, many possible ambiguities. There is an example that generalizes a highly ambiguous match given in Cox: regex (a?){m}(a*){m} against a string of a{m}. The case m=9 gives 864k possible solutions (traversals), you may optionally ask for all of them to be returned:

https://github.com/mike-french/myrex?tab=readme-ov-file#ambi...

The processes share nothing, all state is in the messages, so a single network can process multiple input strings simultaneously. It can also generate Monte-Carlo strings that match the regex, and those may also proceed simultaneously.

That's identical to NFA->DFA conversion but with dynamic programming.
NFAs are decidable. As you said, a NFA can be converted to a DFA which is decidable. Maybe you mean they are non-deterministic (that's in the name)?
You know what I meant. I used the wrong word.
There exist algorithms that deal with state blow-up. In fact, these algorithms project any given DFA into an equivalent DFA of minimal size - guaranteeing a minimal automaton for the language that is accepted by it. In essence, you build equivalence classes between states with respect to possible paths they can take into other equivalence classes for further input. State machines for LR-Grammars are vastly more complex in comparison to NFA automata, so for most (if not all) applications, a computer that can deal with LR-Grammars will be able to handle a NFA conversion with a subsequent minimization.
Decidability is whether one can statically determine if an algorithm halts for a given input. NFAs halt for all inputs, by construction.
You may also find this very detailed post interesting https://devblogs.microsoft.com/dotnet/regular-expression-imp...

It's an indepth look at where regex in dotnet was prior to and after version 7.