Hacker News new | ask | show | jobs
by throw681158 2184 days ago
Won't work if you're already in a string, or if there are escaped quotes in the string. Also won't work if you have two or more double quoted strings that both contain an apostrophe.
4 comments

Escaping isn't an intrinsic property of all quoted strings (e.g. single quoted strings in bash), but even so one can work around them without backreferences, by searching for anything that's not a quote or a backslash, _or_ any escaped character:

    /"([^"\\]|\\.)*"/
Now double that up with a single quote version if you wish.

What you can't match without backreferences, however, is strings with customisable terminators, e.g. the behaviour in sed that whatever character you use after `s` is the regex terminator (it doesn't have to be `/`), or raw strings in C++.

Backreferences don't really help with those problems.

> Won't work if you're already in a string

This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.

> or if there are escaped quotes in the string.

Solvable:

    '(\'|\\|[^\'])*'|"(\"|\\|[^\"])*"
> Also won't work if you have two or more double quoted strings that both contain an apostrophe.

The regex in my previous comment already solves that. See: https://repl.it/repls/SolidCapitalProgram

Raku allows strings inside of strings. Of course it does this by way of embedded closures.

    "abc{ "def" }"
Which allows it to be arbitrarily deep.

    "a{ "b{ "c{ "d{ "e{ "f" }g" }h" }i" }j" }k"
    → "abcdefghijk"
This can be handy to generate the correct string.

    my $count = 3;
    "I went to $count place{ "s" if $count ≠ 1 } today"
Interesting, thanks for pointing out a use case. But I don't think backreferences will help with that, it needs to be parsed by something more powerful than a regex.

And that example reminds me that Bash can do something similar:

    echo "$(echo "$(echo "$(echo "hi")")")"
The Rakudo implementation actually uses Raku regexes to parse Raku. To be fair though it is a lot easier to do that with the redesigned regexes that Raku has.

Basically you can use backreferences for that if you also allow the regex to be recursive.

    my $regex = /
      :ratchet
      $<q> = (<["']>) # the beginning quote

      {}:my $q = ~$<q>; # put it into a more normal lexical var

        # capture between " and {
        $<l> = ( [ <!before $q> <-[{}]> ]* )

        [
          [
            :sigspace
            「{」
                <self=&?BLOCK>? # recurse
            「}」
          ]

          {$q = ~$<q>}

          # capture between } and "
          $<r> = ( [ <!before $q> <-[{}]> ]* )
        ]?

      "$q" # match the end quote

      # pass the combined string parts upwards
      { make ($<l> // '') ~ ($<self>.ast // '') ~ ($<r> // '') }
    /;

    「'a{ "b{ "c{ "d{ 'e{ "f" }g' }h" }i" }j" }k'」 ~~ /^ <r=$regex> $ { make $<r>.ast }/;

    say $/.ast;
    # abcdefghijk
Note that `Regex` is a subtype of `Block`. That is why `&?BLOCK` can be used as a reference to the regex itself.

`<foo=bar>` is a way to call `bar`, but also save it under the name of `foo`. `$<foo> = …` is a way to capture `…` and save it under the name of `foo`.

---

It is a lot nicer and modular when you use regexes as part of a grammar:

    # use Grammar::Tracer;
    grammar String::Grammar {
      token TOP { <strings> }

      rule strings {
        # at least one string
        # if there are more than one they are separated by ~
        <string> + % 「~」
      }

      token string {
        $<q> = <["']>

        # set a dynamic variable to the quote character
        {}:my $*quote = ~$<q>;

        <string-part>*

        "$<q>"
      }

      # multiple tokens that act like one
      # which is nicer than using |
      proto token string-part {*}
      multi token string-part:<non> {
        [ <-[{}]> <!after $*quote> ]+
      }
      multi token string-part:<block> {
        <block>
      }

      rule block {
        「{」 ~ 「}」 <strings>?
      }
    }

    class String::Actions {
      method TOP     ($/) { make     $<strings>.ast }
      method strings ($/) { make [~] @<string>».ast }
      method string  ($/) { make [~] @<string-part>».ast }
      method block   ($/) { make     $<strings>.ast }

      method string-part:<non>   ($/) { make ~$/ }
      method string-part:<block> ($/) { make $<block>.ast }
    }

    say String::Grammar.parse(
        「"a{ "b{ "c{ "d{ "e{ "f" }g" ~ "zz" }h" }i" }j" }k"」,
        :actions( String::Actions ),
    ).ast;
    # abcdefgzzhijk
A `token` is just a `regex` with `:ratchet` mode turned on. (prevents backtracking) A `rule` is just a `token` with `:sigspace` also turned on. (makes it easier to deal with optional whitespace.)

Every instance of `<foo>` is basically a method call.

`make` is about generating an `.ast` to pass up and out of the parse. In this case the only thing the actions class does is return what would be the resulting string if it were compiled in Raku.

Re: already in a string, one of the primary uses of regex is to search from point in a text editor. So, cursor is in a string and you want to find the next string. Regex won't work on its own, you generally need more semantic information to differentiate opening & closing quotes (unless you can use local context from that particular language to infer it).

But more broadly, any situation where you search from a non-zero index has this problem.

I'm surprised your example works in Python. Is that a property of Python's parser, or all regex matchers?

> Regex won't work on its own, you generally need more semantic information

Yeah, I agree. My point was that regex won't work, regardless of if you have backreferences or not. So backreferences won't help.

> But more broadly, any situation where you search from a non-zero index has this problem.

I'm not sure I understand that. A lot of regex libraries let you specify a start index. It won't take into account data from before the start index though (regex doesn't really do that, regardless of backreferences). If your regex library doesn't support passing in a start index, you can just take a substring starting at that index, then search the substring.

I don't think Python is really special. Python's findall() is just a convenience function that does a loop finding a match, then finding another match that starts after the first match, etc. Most languages provide a way to find the end point of the most recent match, and then you can just write the loop yourself to start the next search at that point.

> This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.

    query = "select * from table where name like \"%foo\""
Interesting. Although in that situation I think it would be easier to find the outer string, then unescape it, then find the inner string.

Although if you want to do it with pure regex, it can be done without backreferences too, although it would be exponentially large as you get more and more levels of nesting, whereas with backreferences I think it would only get quadratically large.

As some other replies pointed out, there are straightforward modifications to handle all those scenarios if those are your requirements instead. A place where regex _does_ fail is in arbitrarily nested string interpolations (the key being _arbitrary_ nesting because with enough time anyone can come up with a convoluted enough regex to handle a bounded degree of recursion).
That wasn’t specified in the requirements
Good point.