|
|
|
|
|
by simias
2005 days ago
|
|
I would add "build a toy regex engine" to the list. A couple of years ago I implemented a toy regex engine from scratch (building NFAs then turning them into DFAs). I thought it was an enlightening experience because it showed me that the core principles behind regular languages are fairly simple, although you could spend years optimizing and improving your implementation. How do you deal with unicode? How do you modify your implementation to know how many characters you can skip if you don't have a match in order to avoid testing every single position in a file? It demystified the concept of a regex engine for me while at the same time making me realize how impressive the advanced, ultra optimized engines we use and take for granted are. |
|
I put a bunch of links and quotes about that here, including nascent implementations:
http://www.oilshell.org/blog/2020/07/ideas-questions.html
Also related: http://www.oilshell.org/blog/2020/07/eggex-theory.html
About Unicode, this derivatives project (with video linked in the post) appears to be motivated by Unicode support (though I don't recall exactly why, something about derivatives makes it easier?).
https://github.com/MichaelPaddon/epsilon
https://github.com/MichaelPaddon/epsilon/blob/master/epsilon...
If anyone wants to write a glob engine for https://www.oilshell.org/ let me know :) Right now we use libc but there are a couple reasons why we might have our own (globstar and extended globs)
Trivia: extended globs in bash give globs the power of regular expressions, e.g.
where +(abc|X) is equivalent to (abc|X)+ in "normal" regex syntax, and == is very confusingly the fnmatch() operator.