Hacker News new | ask | show | jobs
by repsilat 2900 days ago
> A more understandable way to process such a string would be to split it into constituent parts and use regex only for validation

Split the regex, or split the URL itself? I kinda think the split regex is pretty reasonable:

    protocol = "[a-z]{3,10}://"
    domain = "([^/?#]*)"
    path = "([^?#]*)"
    query = "(?:\?([^#]*))?"
    fragment = "(?:#(.*))?"
    url = protocol + domain + path + query + fragment
Not so terse now though, probably has to be wrapped in a function now (or stored as a constant somewhere else.)
2 comments

I like the way Perl 6 handles this with the grammar feature. (A grammar is just a special type of class, with a regex as just a special type of method.)

It could be simpler, but I want the resulting data structure to be easier to use.

  grammar Url {
  
    # default regex/token/rule/method to call
    # (token disables backtracking)
    token TOP {
      <protocol> <domain> <path> <query> <fragment>
    }
  
    token protocol {
      <(
  
        <[a..z]> ** 3..10
  
      )>     # don't include :// in the stringified result
  
      '://'  # must be escaped as it isn't alphanumeric
    }
  
    token domain-segment {  <-[?#/.]>+  }
    token domain {
      <domain-segment> ** 2..* # at least 2 domain segments
        % '.'                  # separated by .
  
      <?{
        # make sure that the last segment is at least 3 chars
        # (using the Boolean result of regular Perl 6 code)
        @<domain-segment>.tail.chars >= 3
      }>
    }
  
    token path-segment {  <-[?#/\\]>+  }
    token path {
      [
        <[/\\]>
        <path-segment>*
          %% <[/\\]>     # separated by path separator (allow trailing)
      ]?
    }
  
    token query-segment {
      # store as named, rather than positional
      $<key>   = ( <-[#=&]>+ )
      '='
      $<value> = ( <-[#=&]>+ )

      # run regular Perl 6 code in the regex
      {

        # attach a Pair object as the AST
        make ~$<key> => val(~$<value>)
        # (`val` turns a numeric value into an allomorph)

      }
    }
    token query {
      [
        '?'
        <( # don't include ? in the stringified result

          <query-segment>*
            % '&'         # separated by & (no trailing allowed)

        )>
      ]?
  
      {
        # attach a static associative array of the key value pairs
        # as the AST
        make Map.new: (@<query-segment>».ast if @<query-segment>.elems)
      }
    }
  
    token fragment {
      [
        '#'
         <(  .*  )> # don't include '#' in the stringified result
      ]?
    }
  }
Example usage:

  > my $result = Url.parse('http://perl6.org/foo/bar/baz/?a=1&b=2#fragment');
  > say $result;
  「http://perl6.org/foo/bar/baz/?a=1&b=2#fragment」
   protocol => 「http」
   domain => 「perl6.org」
    domain-segment => 「perl6」
    domain-segment => 「org」
   path => 「/foo/bar/baz/」
    path-segment => 「foo」
    path-segment => 「bar」
    path-segment => 「baz」
   query => 「a=1&b=2」
    query-segment => 「a=1」
     key => 「a」
     value => 「1」
    query-segment => 「b=2」
     key => 「b」
     value => 「2」
   fragment => 「fragment」

  > say $result<query>.ast;
  Map.new((:a(IntStr.new(1, "1")),:b(IntStr.new(2, "2"))))

  > my %query := $result<query>.ast;
  > say %query<b> ~~ Int; # True (because of val(…))
  True
A more advanced usage would be with an actions class.

Basically Perl 6 treats regular expressions as code that is written in a domain specific sub-language, with grammars acting as a structure to hang them off of.

That will certainly work, however for large regular expressions it will become just as unmanageable over time, especially since the concatenation of each part depends on all the previous ones being error free. I was referring to the idea of breaking apart the work done by the regex, into more manageable parts.

One of the complaints is that a regex is too terse. This is because a regex provides you with no internal context of what you're trying to do. You can add context by leveraging additional regex features, but that may potentially make the expression even more difficult for a human to parse. The alternative is to use the regexes more sparingly, and allow whatever the host language is to provide the context. Just because you are able to parse a whole string and capture each part that matches some particular pattern all in one go doesn't mean that it's a good idea. Consider the following quick piece of pseudocode where the URL is split into smaller pieces first, which does a similar job to the regex above:

    protocol, domain, path, query, fragment = explode("<protocol>://<domain>/<path>[?<query>][#<fragment>]")

    if (protocol !~ "[a-z]{3,10}")
        error("invalid protocol")

    if (domain !~ "[a-z]+(\.[a-z]+)*")
        error("invalid domain")

    if (path !~ "[a-z]+(\/[a-z]+)*")
        error("invalid path")

    if (query not nil and query !~ "[a-z]+")
        error("invalid path")

    if (fragment not nil and fragment !~ "[a-z]+")
        error("invalid path")

    success("valid URL!")
Granted, this is much longer relative to the regex-only solution - and it will probably take a bit more effort to implement the magical 'explode' function I've imagined here - however the regexes themselves are now simpler, easier to evaluate, and we have context on what they're there for. Ultimately, we've just stopped using 500 lines worth of features in the regex library in favour of 500 lines of code that does the same thing elsewhere, but arguably have made it all much more understandable.

You'll notice that I've changed the regular expressions for each of the components of the URL. This is because the original regex is essentially only providing the same functionality as the explode() function above, and there is no validation of the contents of each part. Consider this another argument against using regular expressions for this kind of work. Note however, that the language provided to the explode() function itself appears to be regular. This is not unexpected - since URIs are defined using BNF and therefore are either regular or context-free - however it is an example of a scenario where a regular-expression-like language does not have to be cryptic.

The way Perl 6 makes this more manageable is with grammars. (See previous post for an example)

Since a grammar is just a special type of class you can put regexes into roles and compose them together. You can even inherit from another grammar if you only need to change parts of it.

Also if there is something that is difficult to do regularly in the regex sub-language, it allows you to use regular Perl 6 code inline. (A regex is just a special type of method with a domain specific sub-language.)

Also if there is a bug you can use Grammar::Debugger or Grammar::Tracer to help find it. (I had a bug in the earlier post and used Grammar::Tracer to find and fix it within seconds.)

A Perl 6 grammar can look remarkably similar in structure to BNF. https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra... https://trac.ietf.org/trac/json/browser/abnf/json.abnf

Your example is only an example, so this is a minor niggle, but remember you can specify usernames, passwords, and ports in a standard URL too :)