Hacker News new | ask | show | jobs
by Zhyl 1679 days ago
>Regular Expressions, abbreviated as RegEx or RegExp, are a string of characters created within the framework of RegEx syntax rules. You can easily manage your data with RegEx, which uses commands like finding, matching, and editing. Regex can be used in programming languages such as Phyton, SQL, Javascript, R, Google Analytics, Google Data Studio, and throughout the coding process. Learn regex online with examples and tutorials on RegexLearn now.

I always look at these intros/descriptions of Regex with a heavy heart. They describe what regex's are, but none of the info is going to make much sense to someone who doesn't already know why they would want to learn them.

The best motivation for regexes that I've read is actually from a Python Tutorial [0] where the author gives an example of writing a lot of nested 'if' statements that could all be solved by a single regex. On the whole, I think regexes are one of the most powerful tools that doesn't have enough publicity in large part due to this Catch 22 of trying to explain what they are.

[0] https://automatetheboringstuff.com/chapter7/

5 comments

> They describe what regex's are, but none of the info is going to make much sense to someone who doesn't already know why they would want to learn them.

I frequently use the journalist's "5 Ws and H" framework as a checklist procedure for ensuring my technical communication covers fundamental questions/ideas:

* Who

* What

* Where

* Why

* When

* How

The slightly tricky thing is that you have to formulate a question for each W based on your domain. For example what is a fruitful "where" question for RegEx? Nonetheless the checklist makes me less likely to miss very key ideas, such as "why" one would use RegEx.

To make this idea more procedural maybe we could just formulate it as ungrammatical questions where you put the key topic after each W:

* Who RegEx?

* What RegEx?

* Where RegEx?

* Why RegEx?

* When RegEx?

* How RegEx?

And then just let your mind flesh them out into more complete questions...

That's a helpful list / framework. Thanks.

That said, where I see most tech sites / products fail is on addressing benefits. Why should I care? (As opposed to the brand or product why.)

I wish I had $20 for every "Looks cool. But it's not clear to me my life will be any better."

I think regex is absolutely terrible garbage. It’s very powerful, I do appreciate not having to write complex conditional statements, and it’s great that it’s available is so many languages and applications. But it’s just a bad tool, terribly unreadable, easy to introduce bugs, with lots of trial and error. It has too much brevity and would be much better if it were longer but more human readable. I’m sure there are other string search paradigms that are far better but relatively unknown.
You can get a feel for what that would look like here: https://metacpan.org/release/CHROMATIC/Regexp-English-1.01/v...

But then you're just memorizing things like 'start_of_line' instead of '^'. Perhaps easier to read, but no easier to write.

        -> start_of_line
        -> literal('Flippers')
        -> literal(':')
        -> optional
                -> whitespace_char
        -> end
        -> remember
                -> multiple
                        -> digit;
I literally can’t parse this as a whole. /^Flippers:\s?(\d+)/ is so much more obvious compared to that utter nonsense.
Like most code, it's easier to write regex than to read it later. In my recent vim history:

    /(\([^()]\|\n\)*\n\([^()]\|\n\)*)
This was from two days ago. I think I was searching a huge sheet of regex match groups for any having line breaks to join. In a month, I'm not even sure I would recognize that I had authored this.
So what. That was a problem you had to solve, imagine how helpless you’d feel if you had it with no regex available. Matching non-parenthesis or newline for two lines (prefix and suffix unrestricted) it is. Idk if it took half an hour or more to implement that in python, js or (god forbid) a low level language. You probably made it in less than a minute. And nobody would take their time to read a page of .substr(i, -(j-i)-1) two days later either.
not every solution has to be reusable
Your long-hand isn't quite the same as your regex...it should be remember -> one_or_more -> digit;

In regex parlance, \d+ explicitly allows for one or more digits. Multiple tacitly implies 2 or more which would be \d{2,}

Also, your end char (which I assume you mean $) would be after the remember -> one_or_more -> digit;

I didn’t refer to the manual (which is the entire goal of that format, isn’t it?) and don’t know what ‘multiple’ really means. So I stand both corrected and confirmed, I guess.

That ‘end’ thing just closes the ‘optional’ group, I believe. There is no $ in an English form of this regex either.

Readability is very important though. If you can spend a couple of more seconds of programmming time to prevent several minutes(or longer!) of understanding time, I'd call that a good use of resources. I don't think that link is quite there yet but it's a good start.
It's more readable individually, but for many regexes the verbose nature could make it harder to read overall.
There's a good article about K that can give you a feel on how long names may not always be more readable: http://nsl.com/papers/denial.html
The readability isn't so bad if you let yourself allocate as much time and mental effort to understanding the one-line regex as you would use to understand the 100 line string-processing function that it replaces. And the brevity makes regexes handy on the command line and in single-line input fields in text editor search functions.

I do prefer using parser combinators for more complex tasks.

I’m sure there are other string search paradigms that are far better but relatively unknown

Sure if they were, we’d already discover them. All of the regex criticism boils down to few simple statements for categories of cases:

1) I didn’t learn regex and have no cheatsheet

Learn it or at least print a cheatsheet and stick it to the wall.

2) The problem that this specific regex solves is a hell of a regular problem under any representation.

Any particular regex is only as terrible as a ladder of corresponding if’s and for’s would be. Deal with it.

3) The problem that this specific regex solves is not a regular language.

Use a proper xml parser.

You seem to forget that regular expressions are pretty much simply required - and at least for their more simpler cases, their syntax is reasonable - 'syn[a-z ]+?able' is far from unreadable and unwritable.

You have some text to process, open your text editor, you will probably use a dozen regular expressions for that - this is very frequent for many. Can you conceive a better syntax, at least for the simple cases?

Ignoring the flame bait, for me the only things I wish more regex engines supported (cough JavaScript) is the ability to ignore whitespace, and have named groups. Python has a flag to do this, and being able to have multi-line regexes with comments and named groups is phenomenal and greatly improves readability of more complex regexes.

In general I would say ~70% of regexes are highly readable. With tools like the above, you can probably go to like ~85%? There are some regexes that are super complicated and then likely should be refactored into a composition of simpler regexes. But that's just a guess. I wonder if there are any studies done about this...

> Ignoring the flame bait, for me the only things I wish more regex engines supported (cough JavaScript) is the ability to ignore whitespace, and have named groups.

Irregex? http://synthcode.com/scheme/irregex/

Interesting! I don't think that's what I mean. I don't think I want it to be a part of the language like that, but that's a pretty neat idea. For example, python has an 'X' flag you can use when creating a regex to allow new lines and comments. Here's an example from my code: https://github.com/internetarchive/openlibrary/blob/1ac15a48...
I’d argue that regex is elegant and an incredibly useful to have in development… but it’s definitely definitely easy to have ‘too much of a good thing’ here.
Regex is a great tool. Just use good taste and don't overdo it with regexes.

They're very effective at what they do as long as you don't make insane brainteasers that make people curse your name.

This is why I like tools such as RegExBuddy which breaks down the regex into a graphics. It does real-time match highlights of test text and emulates most Regex engines.
It’s just a small programming language.
> I always look at these intros/descriptions of Regex with a heavy heart. They describe what regex's are, but none of the info is going to make much sense to someone who doesn't already know why they would want to learn them.

> Catch 22 of trying to explain what they are.

any teachers, or people who explain/document things for a living, have some good tips or templates to avoid this?

I don't see what's so hard.

"With regex you can search for any combination of characters in a string or return any such combo or modification you like"

Yeah.

1) Encourage whoever you're teaching to stop you immediately if they don't feel like they understand something you're saying, even if it's a single word that's throwing them off, and especially if they're not rock-solid about a simple concept they "should already know". Modern school teaches people that "returning to the basics" is a waste of time; but as Feynman says, you should return to the basics often, as masters do. Pianists don't stop playing scales once they're famous. This means that if your student want to review what an "expression" is, or a what a "string" is, or what "returning" means, you've got to encourage them to do it. If a 10-minute explanation of RegEx turns into a 45-minute review of how the string variable type was invented, that will be more useful for the student in their pursuit of RegEx mastery than will a technically accurate but shallow regurgitation of your 10-minute spiel about what RegEx is. This is because they need to lay the mental framework of how they're going to think about RegEx; you are able to explain it in 10 minutes because you already have that built in your head, but they need to build those background pathways and connections themselves before analogies and summarizations make sense.

2) Try to figure out how you can make them experience the problem that led to the invention of RegEx. A student will never truly understand why a solution is valuable until they really, deeply understand the problem that the solution is solving. Note that I'm not saying that you need to teach the problem before the solution--not every student needs them in that order--just that they won't master the solution until they understand the problem.

3) In lieu of "testing" a student, have them take many breaks to re-explain what they've learned to you, even if you haven't reached a real conclusion about anything and are just checking that they understand a sentence you said. Many students, especially if they have a good teacher, will experience the sensation of comprehension even if it's not actually there. This is the "it makes sense when he says it, but when I try to explain it I can't find the words" phenomena. Taking frequent breaks to have them explain things back to you in their own words will reveal their conceptual weaknesses, and those are what you focus on.

4) Don't try to get it all done in a single session. Learning requires both forgetting and sleep. First, you should tell them to expect to forget, and that they will need to come back over and over again to topics that seem basic or simple; forgetting is part of the process of learning, like painting multiple layers on a wall. Second, they need to sleep in between sessions, which means that you can't teach everything in one day and you can't learn everything in one day, and multiple days may need to be spent reviewing the same material.

This all makes a lot more sense when you treat learning like sports. Learning <programming topic> is like learning a slice serve in tennis. You don't need to serve slice, especially if you can hit flat serves at 115 mph, but serving slice is an invaluable technique when you're playing someone who can't return slice serves at all--that's a near-guaranteed 3/6 games out of every set. But in order to learn it, you need to focus on your tennis fundamentals (stay loose, eye on the ball, toss correctly), practice the same basic movements over and over again, get lots of sleep, and understand why you're learning the skill in the first place.

Good answer. I think 2) is the one that jumped out at me because it reflected my own experience and understanding - Regex became easier to understand when I also felt like I understood its motivations. Starting there, with motivation and context, is my typical go-to move.
Very valuable insight, thank you a lot!
Doesn't make sense to me.

> Regular expressions (commonly known as "regex") are used for advanced pattern matching in strings. They can also be used to replace text, transform strings, or extract substrings. It's a very powerful domain-specific language that is purpose-built for string patterns and manipulation. Many general-purpose programming languages include regex engines that use similar, but often slightly different syntaxes to support the use of regex.

>teachers, or people who explain/document things for a living

I'm neither of those, but I frequently explain things to my friends and they say I explain well. So I will throw my two cents anyway and hope you don't find them trivial self-help platitudes.

(1) Start with Concrete things

No learning ever starts from generalities. Never start with something like "Regular Expressions is a declarative language to describe strings of a certain general form blah blah blah", I call this the wikipedia style of teaching, an utterly useless word-swapping game where you explain things and constructs in terms of even more complicated (or equivalently complicated) things and constructs till the learner runs out of stack space and comes out learning nothing and feeling like a faliure on top of that. Remember that learning is a process of building up, you start from familiar questions, problems, specifics, themes or worldviews of the learner, then gradually introduce generalizations and solutions to get them to where you want them to be.

(This is generally a two-way street, the learner also has to know something about the teacher and where they are coming from and what are they trying to do, it's like telling a story: The author can't simply say "because I say so!" to explain every detail of the plot, but the reader can't also say "I don't know, feels too unbelievable" in response to every plot detail.)

The bare essense of regex is using meta characters to encode several string characters. The fact that the regex

"meta.*"

so powerfully and succinctly encode string-recognizing logic that would be imperatively expressed as

fun metastar(str):

if len(str) < 4 then return false

if str[0:3] != "meta" then return false

return true

Makes the case concretely and perfectly: a single string (two letters longer than the simplest string it matches) versus 3 bug-hiding branches (e.g. what if the "!=" operator in the implementation language actually compares string-identity, not string-equality?). This is even more generous than most languages allow, the ':' array slicing operator for example is saving us a loop. (possibly inefficiently, if it's copying the slice from the string. Not a problem now for "meta", but who knows when it will be?)

Regexes are patterns, which are things that resemble the things they are describing, but aren't any of those thing specifically. It's like a dark silhouette of a man, it doesn't describe any specific man, it's a pattern that can match any man of the same general body plan and height. Regexes are silhouettes, the dark parts are the meta characters that act as placeholders for arbitary strings.

(2) Examples from real life

Don't just take the "menu approach" of reading all the features and meta characters and thinking you're explaining, actually take the time with examples. Again, examples are all that matters for the human brain, it's literally useless to tell somebody to imagine a golden mountain if they have never seen a mountain or gold before.

Our world is awash in strings of certain identifiable structures (money, dates, times, names in formal settings, equations, etc...), try to take the time to obtain several real-life examples, try to make the data come from sources like wikipedia or other publicly available dataset. After demonstrating how each of those 3 or 4 general forms of strings can be described powerfully by this meta character, give 3 or 4 more general forms to the learner to try on their own.

(3) Visualize executions, introducing debugging tools in the process

Just because regexes are declarative, doesn't mean the matching process can't be described in imperative terms, especially initially.

Later on, introduce tools like https://regex101.com/ or https://www.debuggex.com/ and always draw "Rail road digrams" that show what a given regex matches in terms of easily verbalized diagrams.

(4) Disadvantages, subtleties, and other approaches

The learning process isn't a sales pitch, there are plenty of things that suck in regexes. They are non-standard and ad-hocly designed, the runtime engine that runs them can be inefficient (unlikely if the host programming laguage is popular and > 20-years-old, but a thing to keep in mind nontheless: regexes are a whole other language, requiring a seperate interpreter or a compiler other than the one for the surrounding code), and the equivalent imperative code might not be so bad in comparison for simple cases and much more debuggable.

The name "regex" is derived from a misnomer, the orignal "regular expressions" are a mathmetaical formalism to encode finite-state machines, it orignally contained only alternation, sequencing and kleene star (the '|' and the '*' operators, plus putting letters next to each other. That's it, that was the orignal regex capabilities), when programming languages and cmd utilities started to implement them in the 70s and 80s, each started to experiment with features that break this model. For example, "capture groups", the ability of the regex to copy parts of the matched string into variables, trivially break the model : if you can capture arbitarily-long strings, then you can't be a finite state machine.

This increases power but decreases efficiency guarantees (Perl's regex are dangerously close to turing-completenss [https://www.perlmonks.org/?node_id=809842]!, the language is hiding a whole other language inside a single feature) , it also complicate the notation with symbols for the new capabilities that it wasn't designed for, with the result being the mess that regexes' syntax is now. It also means you can never "learn" regex, you can only learn (to whatever accuracy you care) Perl's regex, or Java's regex, or Python's regex. There is a vague set of commonalities, but don't rely on remembering which is a common and which is different when there are so many features implemented in so many ways.

Don't let the learner come away thinking that "declarative" is synonymous with regexes. For example there is the parser combinator style, which can encode the above example as something like:

the_specific_string("meta"). followed_by(ANY_LETTER). repeated(ZERO_OR_MORE_TIMES). build_pattern(). recognize("meta-circular")

the key idea at play here is a sort of "builder pattern". There is an abstract "parser" object that has a single recognize(str) method, and you can build your pattern by composing together the many customizable childrens that implement this abstract interface. The composition happens by "combinator methods", which takes two or more parsers and build a parser that performs a mixture of their functionalities indicated by the name (e.g. followed_by() takes several parsers and sequences them next to each other, repeated() takes a list of parsers and iterates the last one any number of times, including skipping it entirely). The things being built to represent parsers are generally (in functional languages at least) closures, but there is no reason why this pattern can't be built on top of regexes, each step simply generates the equivalent meta-character, and build_pattern returns the final pattern string.

There are tons of those "Parser approaches", formalisms, tools, patterns and libraries to express strings and string-recognition and parsing declaritevly. Regexes are merely the most famous and widespread, which is a sad state of affairs IMO.

Regex is powerful but I've found like 90% of the time I encounter one it would have been far simpler and more readable to use find + substring indexing or string splitting.

Imo replacing several nested if statements with a single esoteric regex is not necessarily a win. It depends on if pattern matching is really the best tool for the job.

find + substring indexing

An endless source of off-by-one errors, not to mention buffer overflows, index out of bounds exceptions, accidental negative indexing.

How are you both getting buffer overflows and bounds check failures in the same code?

Anyway, these are problems that arise if you don't test the code. In that scenario, regexps are an endless source of unexpected behavior as well, including in some implementations stack overflows and ReDoS attack-surfaces.

It very much depends what the use case is. I find that a lot of the text processing I do is easier to use back references or other regexy things.

Having said this, I use tools that make regexes easy to use and readily available - I think in many programming languages the syntax means that other solutions are just as easy to devise and implement.

If you are a solo dev, you do you, but if you are working in a team and you are building huge regexes with back references and other bells and whistles... I would guess it's not very readable for your teammates. At least for me, when I look at such a regex I have to stare at it for minutes before grokking it.
I wonder if there's a metric for code reviews measuring mean-time-to-grok (MTTG).
Regexps are fairly terse and replace a lot of code, compared to most languages they probably have an information density at somewhere between 10x-100x higher (i.e. it's not rare to replace 100 lines of code with 1 regex), so I think it's fair to expect it take longer to unpack their meaning.
Wold that be a reasonable time to ping that coworker and ask them "what does this do?". Not because you can't figure it out, but because they already know?
Also interesting is the fact that parsers and languages are never mentioned. Regular expression engines are parsers for regular languages, one step below context-free languages.

I wonder why there's no context-free language parsers in standard libraries. The Earley parser can take grammars as input without necessarily having to generate code, it would be a great algorithm for a standard context-free parser.