| Yeah the "optimize for years" part is interesting... Supposedly the derivatives technique (re-popularized by a 2009 paper) will build a more optimal DFA directly, rather than building the NFA first, converting to DFA, and then optimizing the DFA. I put a bunch of links and quotes about that here, including nascent implementations: http://www.oilshell.org/blog/2020/07/ideas-questions.html Also related: http://www.oilshell.org/blog/2020/07/eggex-theory.html About Unicode, this derivatives project (with video linked in the post) appears to be motivated by Unicode support (though I don't recall exactly why, something about derivatives makes it easier?). https://github.com/MichaelPaddon/epsilon https://github.com/MichaelPaddon/epsilon/blob/master/epsilon... If anyone wants to write a glob engine for https://www.oilshell.org/ let me know :) Right now we use libc but there are a couple reasons why we might have our own (globstar and extended globs) Trivia: extended globs in bash give globs the power of regular expressions, e.g. [[ abcXabcXXabcabc == +(abc|X) ]] ; echo $?
0
where +(abc|X) is equivalent to (abc|X)+ in "normal" regex syntax, and == is very confusingly the fnmatch() operator. |
The derivatives approach makes Unicode support easier since its able to keep the symbols sets for each transition edge (in the DFA) more compact by virtue of supporting negation. If you add in aggressive term-normalization, hash-consing, and an efficient dense-set implementation (all of which I’ve done in my implementation), the derivatives approach can be extremely fast, even when generating the DFA for something like the lexer of a full programming language (in my case, F#).