| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simias 2005 days ago

I would add "build a toy regex engine" to the list.

A couple of years ago I implemented a toy regex engine from scratch (building NFAs then turning them into DFAs). I thought it was an enlightening experience because it showed me that the core principles behind regular languages are fairly simple, although you could spend years optimizing and improving your implementation. How do you deal with unicode? How do you modify your implementation to know how many characters you can skip if you don't have a match in order to avoid testing every single position in a file?

It demystified the concept of a regex engine for me while at the same time making me realize how impressive the advanced, ultra optimized engines we use and take for granted are.

5 comments

chubot 2005 days ago

Yeah the "optimize for years" part is interesting... Supposedly the derivatives technique (re-popularized by a 2009 paper) will build a more optimal DFA directly, rather than building the NFA first, converting to DFA, and then optimizing the DFA.

I put a bunch of links and quotes about that here, including nascent implementations:

http://www.oilshell.org/blog/2020/07/ideas-questions.html

About Unicode, this derivatives project (with video linked in the post) appears to be motivated by Unicode support (though I don't recall exactly why, something about derivatives makes it easier?).

https://github.com/MichaelPaddon/epsilon

https://github.com/MichaelPaddon/epsilon/blob/master/epsilon...

If anyone wants to write a glob engine for https://www.oilshell.org/ let me know :) Right now we use libc but there are a couple reasons why we might have our own (globstar and extended globs)

Trivia: extended globs in bash give globs the power of regular expressions, e.g.

    [[ abcXabcXXabcabc == +(abc|X) ]] ; echo $?
    0

where +(abc|X) is equivalent to (abc|X)+ in "normal" regex syntax, and == is very confusingly the fnmatch() operator.

link

profquail 2005 days ago

I wrote an implementation of this several years back. If you’re interested in the code: https://github.com/jack-pappas/facio/tree/master/Reggie

The derivatives approach makes Unicode support easier since its able to keep the symbols sets for each transition edge (in the DFA) more compact by virtue of supporting negation. If you add in aggressive term-normalization, hash-consing, and an efficient dense-set implementation (all of which I’ve done in my implementation), the derivatives approach can be extremely fast, even when generating the DFA for something like the lexer of a full programming language (in my case, F#).

link

chubot 2005 days ago

Very cool! And thanks for the reminder about Unicode. I think supporting union and intersection is also somewhat unique to the derivatives method, and also related? (Although I think there are really 2 derivatives methods: Brzozowski and Antimirov)

What happens if you don't do the optimizations? Does the DFA blow up in size, meaning the compile time is large? Or does it make for a slower runtime? I would expect most DFAs to run at about the same speed, unless they are really huge...

I'd be interested in any rough ideas about performance, e.g. how fast a realistic lexer+parser is, maybe in lines/ms.

It does look like the code is pretty short -- a large part of it is an AVL tree library I guess for hash consing?

I'm interested in any downsides of the derivatives technique vs. the NFA->DFA method. I feel like regex compile time shouldn't matter for many applications, and most DFAs will run in the same speed, which only leaves runtime memory usage (or code size for generating F# code like you appear to be doing).

link

profquail 2005 days ago

The optimizations get you the following:

* Normalization: this is where "smart constructors" come in handy; having a normal form for the terms allows the caching to work better. This also impacts the compactness of the generated DFA. * Hash-consing: this turns structural equality (in this case) to a simple pointer equality; applied recursively, this makes it much faster to compare two terms for equality, and overall speeds up the DFA generation by a non-trivial amount (I forget the exact numbers, but it was significant). * Dense set implementation: The AVL tree-based data structure in the facio/Reggie code is an implementation of the Discrete Interval Encoding Tree (DIET) data structure from "Diets for fat sets" and "More on Balanced Diets" papers.

Note the optimizations I've mentioned here impact the performance of generating the DFA. Once you have the DFA, it'll run at the same speed as one generated in any other way. Part of the motiviation for my writing this library was to learn about regex/DFAs/grammars, but also to try to improve on the performance of fslex/fsyacc at the time. Using this library, the FSharpLex tool can generate the DFA for the full F# language grammar in well under 1 sec; the code generation takes a bit longer, largely due to having to convert the DFA into a different form for backwards-compatibility with fslex.

Overall, I feel like the derivatives technique is generally better and simpler, and I'm not aware of any real downsides. The only one that comes to mind is if you're wanting to implement things like backreferences and capture groups -- those obviously make the implementation (of the DFA) more complicated, and there's a lot less literature on it (last I saw, maybe only one or two papers on implementing those features on top of a derivatives-based regex engine).

link

chubot 2004 days ago

Great info, thanks! I think derivatives could be very suited for a compact implementation of POSIX regular expressions. You need to handle unicode classes but not backreferences!

Although it does seem more suited for functional languages for sure, whereas I basically only have a C runtime.

Capturing might be an issue. I found this

https://www.home.hs-karlsruhe.de/~suma0002/publications/posi...

but I think it's actually being a bit pedantic, i.e. if "almost all POSIX implementations are buggy" then applications don't rely on that exact semantic (they probably rely on the buggy one, if anything ...)

Maybe more relevant: http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp1...

link

schoen 2005 days ago

I really enjoyed taking Ullman's Automata course on Coursera. I found it was great for better appreciating topics like

* searching

* implementation of automata in electronic circuits

* challenges of formal specifications for things like protocols and grammars, as well as for verifying their correctness; implementation strategies for applying these specifications

* computability and complexity

* programming language theory

* history of computer science

* LANGSEC arguments

in addition to having an austere mathematical beauty.

link

psibi 2005 days ago

Somehow I found the course very dry (as compared to other coursera course like Dan Grossman's PL course).

link

schoen 2004 days ago

I agree that it was quite dry, but the intellectual content was great.

link

theshank 2005 days ago

Thank you for sharing this one. I was looking for a course like this! BTW this course is now offered on edx.

link

schoen 2005 days ago

Oh, thanks for the update! It looks like the new version is at

https://www.edx.org/course/automata-theory

Definitely recommended if you like somewhat dry and mathematical stuff with deep relevance to many areas of computer science. :-)

link

KMag 2005 days ago

Have a look at RE2 and Russ Cox's paper [0]. I found it particularly elegant the use of a small LRU cache to effectively lazily convert portions of the NFA to a DFA. A fast regex engine is pretty easy to implement, as long as you don't need extensions, particularly backreferences.

[0] https://swtch.com/~rsc/regexp/regexp1.html

link

ignoramous 2005 days ago

See also this Elevator game: https://play.elevatorsaga.com/

link

abhgh 2005 days ago

I agree with this ... I coded a Thompson NFA [1] out of interest a few years ago; definitely recommended as an exercise.

[1] https://swtch.com/~rsc/regexp/regexp1.html

link