| HN Mirror

Personally I use https://regex101.com to test and validate any nontrivial regex, and then I actually put a permalink to the "saved regex" in a comment in the code, so any future viewer (including myself) can review it. I also occasionally put patterns into their own standalone objects or functions (depending on the language), which allows you to test them right in your test suite.

I also make extensive use of the "verbose mode" in Python. Adapted from the example in https://docs.python.org/3/howto/regex.html, compare this:

    pattern = re.compile(r"^\s*&#(0[0-7]+|[0-9]+|x[0-9a-fA-F]+)\s*;\s*$")

and this one attempt to clean it up:

    pattern = re.compile(
        "^\s*"
        "&#("
        "0[0-7]+"
        "|[0-9]+"
        "|x[0-9a-fA-F]+"
        ")\s*;\s*$"
    )

to this:

    pattern = re.compile(r"""
      ^\s*
      &[#]                 # Start of a numeric entity reference
        (
            0[0-7]+        # Octal form
          | [0-9]+         # Decimal form
          | x[0-9a-fA-F]+  # Hexadecimal form
        )
      \s*;                 # Trailing semicolon
      \s*$
    """,
    re.VERBOSE)

It's still not ideal, but for me it's a good balance between terseness (greater information density) and readability.

The equivalent in Pomsky (I think this is the one that was formerly Rulex? https://pomsky-lang.org/) would be very similar:

    Start [s]*
    '&#'    # Start of a numeric entity reference
    (
      # Octal form
        '0' ['0' - '7']+
      # Decimal form
      | ['0' - '9']+
      # Hexadecimal form
      | 'x' ['0' - '9' 'a' - 'f' 'A' - 'F']+
    )
    [s]* ';'    # Trailing semicolon
    [s]* End

and arguably more verbose, due to the mandatory quotation marks. Note that Pomsky actually inherits the ambiguity of "Start" and "End" that led to this security bug in the first place!

Pomsky gets you a few other advantages, e.g. compatibility and polyfills across different regex engines, but the similar syntax I think goes to show how dramatic of an improvement "verbose regex" mode can be.

Finally, you have "English-like" DSLs more akin to my original suggestion, as in ReadableRegex.jl (https://github.com/jkrumbiegel/ReadableRegex.jl). I'm not sure how you'd construct the above pattern in that DSL, but I am sure that you would trade away information density and a sense of overall structure, and gain increased clarity of each individual operation. Set your priorities accordingly.