Hacker News new | ask | show | jobs
by olsgaarddk 1357 days ago
I thought I was pretty good at regex, but I could never have written this one and had to consult both `man grep` and regex101.com.

Explanations for the beginner and intermediate regex and grep user:

`-o`: Only return the match, instead of the entire line

`-P`: use Perl compatible regex

`-m` max-count, Stop reading a file after NUM matching lines.

And now for the regex:

`name:`: find the exact match

`\s*"`: Zero or more spaces leading up to and including an double quote

`\K`: This was the kicker for me. "resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match" - basically tells the regex engine that the characters _before_ `\K` needs to be there in order to form a match, but it should only return the characters _after_ `\K` as the match. This is super handy! Is there a "reverse \K"?

`[^"]+`: One or more characters that are not a double quote. This basically means "Find the line that has a key called "name" and return all the characters after the first double quote and until the last double quote"

3 comments

If you'd like to learn more about such grep powers, check out my free ebook [0]

What do you mean by "reverse \K"? Are you aware of lookarounds? Perhaps you meant positive lookahead?

    # match digits only if there is a semicolon afterwards
    $ echo '12; 42,31;100' | grep -oP '\d+(?=;)'
    12
    31

[0] https://learnbyexample.github.io/learn_gnugrep_ripgrep/intro...
In vim there is `\zs` and `\ze` where `\zs` is the `\K` equivalent in grep.

Basically

  :%s/hello \zsworld\ze out there/planet/g
would find all `hello world out there` and replace `world` to `planet`.
Consider this:

    grep -P 'start: (\d+) end'
"How do I make it print only the captured group with the number, not the whole line?" is a pretty common Stack Overflow question. The "\K" thing gets rid of the "start: " part, but what about " end"? That's were "reverse \K" would come in handy.
That's where ripgrep's -r/--replace flag comes in handy:

    $ echo 'foobar start: 123 end quuxbar' | rg 'start: ([0-9]+) end'
    foobar start: 123 end quuxbar
    $ echo 'foobar start: 123 end quuxbar' | rg 'start: ([0-9]+) end' -r '$1'
    foobar 123 quuxbar
    $ echo 'foobar start: 123 end quuxbar' | rg 'start: ([0-9]+) end' -or '$1'
    123
That's where lookarounds help:

    grep -oP 'start: \K\d+(?= end)'
`\K` is kinda similar to lookbehind (but not exactly same as it is not zero-width), and particularly helpful for variable length patterns.

If you need to process further, you can make use of `-r` option in `ripgrep` or move to other tools like sed, awk, perl, etc.

> all the characters after the first double quote and until the last double quote

Until the next double quote, not necessarily the last one.