| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Banana699 1678 days ago

>teachers, or people who explain/document things for a living

I'm neither of those, but I frequently explain things to my friends and they say I explain well. So I will throw my two cents anyway and hope you don't find them trivial self-help platitudes.

(1) Start with Concrete things

No learning ever starts from generalities. Never start with something like "Regular Expressions is a declarative language to describe strings of a certain general form blah blah blah", I call this the wikipedia style of teaching, an utterly useless word-swapping game where you explain things and constructs in terms of even more complicated (or equivalently complicated) things and constructs till the learner runs out of stack space and comes out learning nothing and feeling like a faliure on top of that. Remember that learning is a process of building up, you start from familiar questions, problems, specifics, themes or worldviews of the learner, then gradually introduce generalizations and solutions to get them to where you want them to be.

(This is generally a two-way street, the learner also has to know something about the teacher and where they are coming from and what are they trying to do, it's like telling a story: The author can't simply say "because I say so!" to explain every detail of the plot, but the reader can't also say "I don't know, feels too unbelievable" in response to every plot detail.)

The bare essense of regex is using meta characters to encode several string characters. The fact that the regex

"meta.*"

so powerfully and succinctly encode string-recognizing logic that would be imperatively expressed as

fun metastar(str):

if len(str) < 4 then return false

if str[0:3] != "meta" then return false

return true

Makes the case concretely and perfectly: a single string (two letters longer than the simplest string it matches) versus 3 bug-hiding branches (e.g. what if the "!=" operator in the implementation language actually compares string-identity, not string-equality?). This is even more generous than most languages allow, the ':' array slicing operator for example is saving us a loop. (possibly inefficiently, if it's copying the slice from the string. Not a problem now for "meta", but who knows when it will be?)

Regexes are patterns, which are things that resemble the things they are describing, but aren't any of those thing specifically. It's like a dark silhouette of a man, it doesn't describe any specific man, it's a pattern that can match any man of the same general body plan and height. Regexes are silhouettes, the dark parts are the meta characters that act as placeholders for arbitary strings.

(2) Examples from real life

Don't just take the "menu approach" of reading all the features and meta characters and thinking you're explaining, actually take the time with examples. Again, examples are all that matters for the human brain, it's literally useless to tell somebody to imagine a golden mountain if they have never seen a mountain or gold before.

Our world is awash in strings of certain identifiable structures (money, dates, times, names in formal settings, equations, etc...), try to take the time to obtain several real-life examples, try to make the data come from sources like wikipedia or other publicly available dataset. After demonstrating how each of those 3 or 4 general forms of strings can be described powerfully by this meta character, give 3 or 4 more general forms to the learner to try on their own.

(3) Visualize executions, introducing debugging tools in the process

Just because regexes are declarative, doesn't mean the matching process can't be described in imperative terms, especially initially.

Later on, introduce tools like https://regex101.com/ or https://www.debuggex.com/ and always draw "Rail road digrams" that show what a given regex matches in terms of easily verbalized diagrams.

(4) Disadvantages, subtleties, and other approaches

The learning process isn't a sales pitch, there are plenty of things that suck in regexes. They are non-standard and ad-hocly designed, the runtime engine that runs them can be inefficient (unlikely if the host programming laguage is popular and > 20-years-old, but a thing to keep in mind nontheless: regexes are a whole other language, requiring a seperate interpreter or a compiler other than the one for the surrounding code), and the equivalent imperative code might not be so bad in comparison for simple cases and much more debuggable.

The name "regex" is derived from a misnomer, the orignal "regular expressions" are a mathmetaical formalism to encode finite-state machines, it orignally contained only alternation, sequencing and kleene star (the '|' and the '*' operators, plus putting letters next to each other. That's it, that was the orignal regex capabilities), when programming languages and cmd utilities started to implement them in the 70s and 80s, each started to experiment with features that break this model. For example, "capture groups", the ability of the regex to copy parts of the matched string into variables, trivially break the model : if you can capture arbitarily-long strings, then you can't be a finite state machine.

This increases power but decreases efficiency guarantees (Perl's regex are dangerously close to turing-completenss [https://www.perlmonks.org/?node_id=809842]!, the language is hiding a whole other language inside a single feature) , it also complicate the notation with symbols for the new capabilities that it wasn't designed for, with the result being the mess that regexes' syntax is now. It also means you can never "learn" regex, you can only learn (to whatever accuracy you care) Perl's regex, or Java's regex, or Python's regex. There is a vague set of commonalities, but don't rely on remembering which is a common and which is different when there are so many features implemented in so many ways.

Don't let the learner come away thinking that "declarative" is synonymous with regexes. For example there is the parser combinator style, which can encode the above example as something like:

the_specific_string("meta"). followed_by(ANY_LETTER). repeated(ZERO_OR_MORE_TIMES). build_pattern(). recognize("meta-circular")

the key idea at play here is a sort of "builder pattern". There is an abstract "parser" object that has a single recognize(str) method, and you can build your pattern by composing together the many customizable childrens that implement this abstract interface. The composition happens by "combinator methods", which takes two or more parsers and build a parser that performs a mixture of their functionalities indicated by the name (e.g. followed_by() takes several parsers and sequences them next to each other, repeated() takes a list of parsers and iterates the last one any number of times, including skipping it entirely). The things being built to represent parsers are generally (in functional languages at least) closures, but there is no reason why this pattern can't be built on top of regexes, each step simply generates the equivalent meta-character, and build_pattern returns the final pattern string.

There are tons of those "Parser approaches", formalisms, tools, patterns and libraries to express strings and string-recognition and parsing declaritevly. Regexes are merely the most famous and widespread, which is a sad state of affairs IMO.