| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shafte 2900 days ago
	To be fair, you are not including more advanced operators, like positive/negative lookahead/behind (which is the specific example the article uses), capturing and non-capturing groups, greedy vs non-greedy kleene stars, etc. As you say, they are implementation specific, but that's part of the problem: the basic regular expression syntax is insufficient for many tasks, so people take to extending it in complicated and syntactically opaque ways. That's the sign of a bad DSL, not a good one.

1 comments

jmts 2900 days ago

Maybe I'm arguing semantics here. To be clear, my point is that I do not agree it is reasonable to declare that regular expressions are a bad DSL simply because it is possible (however common) for people to write difficult to read, or difficult to understand regular expressions. It is the responsibility of the author of the expression to ensure that it is readable and understandable - to the extent that they should exercise restraint when possible use of an available feature would hinder readability and understandability.

There is absolutely no need for the example regex of http-like strings to be written the way that it is - there is only the want of the author, because they have a hammer and they are looking for a nail. If anything, using a regex for such a thing sets a bad precedent because anybody who wishes to come along and add user@password support to it is going to extend it and make it worse.

A more understandable way to process such a string would be to split it into constituent parts and use regex only for validation. Split at the :// for schema, split at the next / for path, etc. Turn these into functions, and keep the regexes simple.

Regular expressions are notorious because they are abused, not because they are evil.

link

repsilat 2900 days ago

> A more understandable way to process such a string would be to split it into constituent parts and use regex only for validation

Split the regex, or split the URL itself? I kinda think the split regex is pretty reasonable:

    protocol = "[a-z]{3,10}://"
    domain = "([^/?#]*)"
    path = "([^?#]*)"
    query = "(?:\?([^#]*))?"
    fragment = "(?:#(.*))?"
    url = protocol + domain + path + query + fragment

Not so terse now though, probably has to be wrapped in a function now (or stored as a constant somewhere else.)

link

b2gills 2892 days ago

I like the way Perl 6 handles this with the grammar feature. (A grammar is just a special type of class, with a regex as just a special type of method.)

It could be simpler, but I want the resulting data structure to be easier to use.

  grammar Url {
  
    # default regex/token/rule/method to call
    # (token disables backtracking)
    token TOP {
      <protocol> <domain> <path> <query> <fragment>
    }
  
    token protocol {
      <(
  
        <[a..z]> ** 3..10
  
      )>     # don't include :// in the stringified result
  
      '://'  # must be escaped as it isn't alphanumeric
    }
  
    token domain-segment {  <-[?#/.]>+  }
    token domain {
      <domain-segment> ** 2..* # at least 2 domain segments
        % '.'                  # separated by .
  
      <?{
        # make sure that the last segment is at least 3 chars
        # (using the Boolean result of regular Perl 6 code)
        @<domain-segment>.tail.chars >= 3
      }>
    }
  
    token path-segment {  <-[?#/\\]>+  }
    token path {
      [
        <[/\\]>
        <path-segment>*
          %% <[/\\]>     # separated by path separator (allow trailing)
      ]?
    }
  
    token query-segment {
      # store as named, rather than positional
      $<key>   = ( <-[#=&]>+ )
      '='
      $<value> = ( <-[#=&]>+ )

      # run regular Perl 6 code in the regex
      {

        # attach a Pair object as the AST
        make ~$<key> => val(~$<value>)
        # (`val` turns a numeric value into an allomorph)

      }
    }
    token query {
      [
        '?'
        <( # don't include ? in the stringified result

          <query-segment>*
            % '&'         # separated by & (no trailing allowed)

        )>
      ]?
  
      {
        # attach a static associative array of the key value pairs
        # as the AST
        make Map.new: (@<query-segment>».ast if @<query-segment>.elems)
      }
    }
  
    token fragment {
      [
        '#'
         <(  .*  )> # don't include '#' in the stringified result
      ]?
    }
  }

Example usage:

  > my $result = Url.parse('http://perl6.org/foo/bar/baz/?a=1&b=2#fragment');
  > say $result;
  ｢http://perl6.org/foo/bar/baz/?a=1&b=2#fragment｣
   protocol => ｢http｣
   domain => ｢perl6.org｣
    domain-segment => ｢perl6｣
    domain-segment => ｢org｣
   path => ｢/foo/bar/baz/｣
    path-segment => ｢foo｣
    path-segment => ｢bar｣
    path-segment => ｢baz｣
   query => ｢a=1&b=2｣
    query-segment => ｢a=1｣
     key => ｢a｣
     value => ｢1｣
    query-segment => ｢b=2｣
     key => ｢b｣
     value => ｢2｣
   fragment => ｢fragment｣

  > say $result<query>.ast;
  Map.new((:a(IntStr.new(1, "1")),:b(IntStr.new(2, "2"))))

  > my %query := $result<query>.ast;
  > say %query<b> ~~ Int; # True (because of val(…))
  True

A more advanced usage would be with an actions class.

Basically Perl 6 treats regular expressions as code that is written in a domain specific sub-language, with grammars acting as a structure to hang them off of.

link

jmts 2899 days ago

That will certainly work, however for large regular expressions it will become just as unmanageable over time, especially since the concatenation of each part depends on all the previous ones being error free. I was referring to the idea of breaking apart the work done by the regex, into more manageable parts.

One of the complaints is that a regex is too terse. This is because a regex provides you with no internal context of what you're trying to do. You can add context by leveraging additional regex features, but that may potentially make the expression even more difficult for a human to parse. The alternative is to use the regexes more sparingly, and allow whatever the host language is to provide the context. Just because you are able to parse a whole string and capture each part that matches some particular pattern all in one go doesn't mean that it's a good idea. Consider the following quick piece of pseudocode where the URL is split into smaller pieces first, which does a similar job to the regex above:

    protocol, domain, path, query, fragment = explode("<protocol>://<domain>/<path>[?<query>][#<fragment>]")

    if (protocol !~ "[a-z]{3,10}")
        error("invalid protocol")

    if (domain !~ "[a-z]+(\.[a-z]+)*")
        error("invalid domain")

    if (path !~ "[a-z]+(\/[a-z]+)*")
        error("invalid path")

    if (query not nil and query !~ "[a-z]+")
        error("invalid path")

    if (fragment not nil and fragment !~ "[a-z]+")
        error("invalid path")

    success("valid URL!")

Granted, this is much longer relative to the regex-only solution - and it will probably take a bit more effort to implement the magical 'explode' function I've imagined here - however the regexes themselves are now simpler, easier to evaluate, and we have context on what they're there for. Ultimately, we've just stopped using 500 lines worth of features in the regex library in favour of 500 lines of code that does the same thing elsewhere, but arguably have made it all much more understandable.

You'll notice that I've changed the regular expressions for each of the components of the URL. This is because the original regex is essentially only providing the same functionality as the explode() function above, and there is no validation of the contents of each part. Consider this another argument against using regular expressions for this kind of work. Note however, that the language provided to the explode() function itself appears to be regular. This is not unexpected - since URIs are defined using BNF and therefore are either regular or context-free - however it is an example of a scenario where a regular-expression-like language does not have to be cryptic.

link

b2gills 2892 days ago

The way Perl 6 makes this more manageable is with grammars. (See previous post for an example)

Since a grammar is just a special type of class you can put regexes into roles and compose them together. You can even inherit from another grammar if you only need to change parts of it.

Also if there is something that is difficult to do regularly in the regex sub-language, it allows you to use regular Perl 6 code inline. (A regex is just a special type of method with a domain specific sub-language.)

Also if there is a bug you can use Grammar::Debugger or Grammar::Tracer to help find it. (I had a bug in the earlier post and used Grammar::Tracer to find and fix it within seconds.)

A Perl 6 grammar can look remarkably similar in structure to BNF. https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra... https://trac.ietf.org/trac/json/browser/abnf/json.abnf

link

stevekemp 2899 days ago

Your example is only an example, so this is a minor niggle, but remember you can specify usernames, passwords, and ports in a standard URL too :)

link

ljw1001 2899 days ago

> I do not agree it is reasonable to declare that regular expressions are a bad DSL simply because it is possible (however common) for people to write difficult to read, or difficult to understand regular expressions.

I would not call RE a bad language - they are simply too useful for that. I would argue, though, that it's a reasonable design criticism to say - not that it's possible to write difficult to read code - but that it difficult to write easy to read code.

link