Hacker News new | ask | show | jobs
by maratd 4139 days ago
Regular expressions are a natural fit for construction of regular expressions.

Look, I know it takes a while, but once you get the hang of it, you won't need any crutches to write regular expressions. The only tool that's really needed is a way to rigorously test a regular expression to make sure it does what it needs to do and there are a ton of those around.

6 comments

No, they're really not, as evidenced by all the quoting and meta-character nonsense you have to deal with. Sure, it's not too difficult to figure out, most of the time, but I think a solution that puts characters and logic on different quoting levels will almost always be better from an expressiveness standpoint (ignoring ecosystem issues).
This is usually borne by the string literal being used to express the regular expression literal syntax in many languages. Perl, for example, has a regular expression literal syntax that is part of the language proper (which has the added benefit that non-dynamic regular expressions can be checked for syntax at compile time). Python, in contrast, doesn't have a first-class regular expression literal, but makes it easier to deal with by prefixing the literal with r or R to create a "raw string" (which exists to avoid excessive backslash escaping). Some regular expression engines use % as the meta-character indicator, which is more compatible with C-style "escape sequences" in double-quoted strings).

If you think characters and logic need to be on different quoting levels, you're not taking the right perspective on regular expressions. \d or \w are not an escaped d or w, they are their own atoms (or "the keywords of the language", if you will), distinct from the atoms that match the ASCII characters 0x64 and 0x77. The thing to remember with regular expressions is always the first lesson presented: (non-meta) characters match themselves, the regular expression /a/ matches the letter a. What's implied here, but rarely said, is that that's not really the letter a in there, but rather an expression that matches the letter a—it just so happens to also look like the thing it matches. This distinction is subtle, but important. This can also be made more evident by using the /x modifier if it's available to spread out the individual expressions (put space between the keywords).

The primary difference in regular expression languages is often how "logic", as you call it, is expressed. PCRE considers, for example, [ to be the character for opening a character class and \[ to match the byte 0x5b. Admittedly, this is confusing when switching engines because 1) not every character matches itself (the expression that matches a character and the character it matches are not visually the same) and 2) other RE engines have taken the opposite approach depending on if that engine was meant, by the author, to have more literal atoms or more logic in its most common use (that is, you save typing if you mean to match the byte 0x5b more frequently than if you mean to open a character class).

As for "quoting", you almost NEVER should be using things like PCRE's \Q…\E (or the quotemeta function) unless you're building regular expressions dynamically from user-input. quotemeta and friends are not readability tools, but safety tools.

I'm using the term "quoting" in the general sense of a marker that some sequence of symbols is being used as symbols, rather than for their semantic values.

My perspective on regular expressions in one of a student who was not two weeks ago introduced to the formal version of REs. In this formalism, there are basically strings and operators on these strings. We don't usually use quotes, but only because you can usually infer from context which bits are strings and which are one of the small set of operators. But when we need to match numbers with possible "+"es (the alternation operator) in front of them, out come the quotes.

In a typical programming language, we don't have the luxury of expecting the interpreter to infer things like that from context. Further, it's rather common to try to match things that would otherwise be used as metacharacters. This is exactly why quoting, in the general sense, was invented, so we can tell what's the program and what's the input.

Granted, most of my RE experience is in Python, where everything is just jammed in a string. There it's obvious that metacharacters and escapes are just a worse-is-better substitute for quasiquoting. Maybe it's different in Perl, but I'm skeptical. Strings matching themselves is cool. The problem is that it's cool enough to prevent you from realizing when you've taken the metaphor too far.

I agree with you. Every now and then I see mentions of "all-new-regex-builder" on HN frontpage. What is up with regex and desire to write wrappers upon wrappers on top of it?

I see regex like that: if you have to use it often enough, better to learn it as it is - will be more helpful in the long run. If you don't use regex too often then just google your question - there's a very high chance that somebody already wrote regex for your or similar problem.

Only tools I ever use are regex testers (like regexr.com) when I need to make sure that pattern works correctly.

But alternative syntaxes are regular expressions too.
It's not a "crutch", it's an "alternative". Couching it in negative terms isn't really fair.

While I prefer writing regexes, a regex DSL isn't fundamentally better or worse, just different. In addition, it allows non-computer people to write, or at least specify, regexes in a way that makes more sense to non-developers.

Alternate representations of regexes aren't necessarily a crutch to avoid learning the normal syntax. S-expressions in particular could be useful for runtime manipulation or generation of patterns without the bother of string mangling. (I can't think of a reason to do so off-hand, but it's a nifty capability.)
Here's an example of this kind of thing from some emacs lisp I wrote (which I hope survived the transition to the HN comment box):

    (setq imenu-generic-expression
      (let ((ident '(1+ (any "A-Za-z0-9_"))))
        `(("plugin" ,(rx line-start
                         (0+ space) "plugin"
                         (1+ space) (eval ident)
                         (1+ space) (group (eval ident)))
                         1))))
Of course, you can do this with string concatenation, but I think this syntax makes it clearer what's going on.
>Regular expressions are a natural fit for construction of regular expressions.

The particular syntax we use (which is not that great) is not THE "regular expressions" is just one syntax we arrived at.

That is, the "regular expressions" name doesn't refer to the syntax, but to the concept.