Hacker News new | ask | show | jobs
by IshKebab 1435 days ago
I like regex too, but only for use in interactive contexts where you can verify the results (editors, search engines, etc). It's quite like Bash in that regard. Good for when you want to get a lot done without a lot of typing and you don't care if it only works on the input you have in front of you. A terrible idea everywhere else.

I also agree that more verbose syntax would help a lot. I've seen quite a few attempts to do that recently (e.g. the project formerly known as Rulex).

1 comments

Personally I use https://regex101.com to test and validate any nontrivial regex, and then I actually put a permalink to the "saved regex" in a comment in the code, so any future viewer (including myself) can review it. I also occasionally put patterns into their own standalone objects or functions (depending on the language), which allows you to test them right in your test suite.

I also make extensive use of the "verbose mode" in Python. Adapted from the example in https://docs.python.org/3/howto/regex.html, compare this:

    pattern = re.compile(r"^\s*&#(0[0-7]+|[0-9]+|x[0-9a-fA-F]+)\s*;\s*$")
and this one attempt to clean it up:

    pattern = re.compile(
        "^\s*"
        "&#("
        "0[0-7]+"
        "|[0-9]+"
        "|x[0-9a-fA-F]+"
        ")\s*;\s*$"
    )
to this:

    pattern = re.compile(r"""
      ^\s*
      &[#]                 # Start of a numeric entity reference
        (
            0[0-7]+        # Octal form
          | [0-9]+         # Decimal form
          | x[0-9a-fA-F]+  # Hexadecimal form
        )
      \s*;                 # Trailing semicolon
      \s*$
    """,
    re.VERBOSE)
It's still not ideal, but for me it's a good balance between terseness (greater information density) and readability.

The equivalent in Pomsky (I think this is the one that was formerly Rulex? https://pomsky-lang.org/) would be very similar:

    Start [s]*
    '&#'    # Start of a numeric entity reference
    (
      # Octal form
        '0' ['0' - '7']+
      # Decimal form
      | ['0' - '9']+
      # Hexadecimal form
      | 'x' ['0' - '9' 'a' - 'f' 'A' - 'F']+
    )
    [s]* ';'    # Trailing semicolon
    [s]* End
and arguably more verbose, due to the mandatory quotation marks. Note that Pomsky actually inherits the ambiguity of "Start" and "End" that led to this security bug in the first place!

Pomsky gets you a few other advantages, e.g. compatibility and polyfills across different regex engines, but the similar syntax I think goes to show how dramatic of an improvement "verbose regex" mode can be.

Finally, you have "English-like" DSLs more akin to my original suggestion, as in ReadableRegex.jl (https://github.com/jkrumbiegel/ReadableRegex.jl). I'm not sure how you'd construct the above pattern in that DSL, but I am sure that you would trade away information density and a sense of overall structure, and gain increased clarity of each individual operation. Set your priorities accordingly.

Yeah the Pomsky one is already way better because you can easily see that &# are literal characters, not some weird regex thing you've forgotten about.

That's one of the biggest issues with regex - mixing up data and control.

But I would still expect a robust codebase to have a proper number parser if you want to parse this sort of thing.

What is regex but shorthand notation for a parser?

I agree that a good codebase should generally have its regex segregated into standalone functions with their own tests (ideally property-based tests!).