| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tlrobinson 4142 days ago

Properly formatted (to be fair this is from a blog post explaining how the regex works: http://daringfireball.net/2010/07/improved_regex_for_matchin...):

    (?xi)
    \b
    (                           # Capture 1: entire matched URL
      (?:
        [a-z][\w-]+:                # URL protocol and colon
        (?:
          /{1,3}                        # 1-3 slashes
          |                             #   or
          [a-z0-9%]                     # Single letter or digit or '%'
                                        # (Trying not to match e.g. "URI::Escape")
        )
        |                           #   or
        www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
        |                           #   or
        [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
      )
      (?:                           # One or more:
        [^\s()<>]+                      # Run of non-space, non-()<>
        |                               #   or
        \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
      )+
      (?:                           # End with:
        \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
        |                                   #   or
        [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
      )
    )

1 comments

raiph 4141 days ago

cf the Perl 6 community module for parsing URIs which features Perl 6's unique unification of regexes and grammars:

https://github.com/perl6-community-modules/uri/blob/master/l...

link