| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tolmasky 589 days ago

I implemented something similar to the compositional regular expressions feature described here for JavaScript a while ago (independently, so semantics may not be the same), and it is one of the libraries I find myself most often bringing into other projects years later. It gets you a tiny bit closer to feeling like you have a first-class parser in the language. Here is an example of implementing media type parsing with regexes using it: https://runkit.com/tolmasky/media-type-parsing-with-template...

"templated-regular-expression" on npm, GitHub: https://github.com/tolmasky/templated-regular-expression

To be clear, programming languages should just have actual parsers and you shouldn't use regular expressions for parsers. But if you ARE going to use a regular expression, man is it nice to break it up into smaller pieces.

1 comments

b2gills 588 days ago

"Actual parsers" aren't powerful enough to be used to parse Raku.

Raku regular expressions combined with grammars are far more powerful, and if written well, easier to understand than any "actual parser". In order to parse Raku with an "actual parser" it would have to allow you to add and remove things from it as it is parsing. Raku's "parser" does this by subclassing the current grammar adding or removing them in the subclass, and then reverting back to the previous grammar at the end of the current lexical scope.

In Raku, a regular expression is another syntax for writing code. It just has a slightly different default syntax and behavior. It can have both parameters and variables. If the regular expression syntax isn't a good fit for what you are trying to do, you can embed regular Raku syntax to do whatever you need to do and return right back to regular expression syntax.

It also has a much better syntax for doing advanced things, as it was completely redesigned from first principles.

The following is an example of how to match at least one `A` followed by exactly that number of `B`s and exactly that number of `C`s.

(Note that bare square brackets [] are for grouping, not for character classes.)

  my $string = 'AAABBBCCC';

  say $string ~~ /
    ^

    # match at least one A
    # store the result in a named sub-entry
    $<A> = [ A+ ]

    {} # update result object

    # create a lexical var named $repetition
    :my $repetition = $<A>.chars(); # <- embedded Raku syntax

    # match B and then C exactly $repetition times
    $<B> = [ B ** {$repetition} ]
    $<C> = [ C ** {$repetition} ]
  
    $
  /;

Result:

  ｢AAABBBCCC｣
  A => ｢AAA｣
  B => ｢BBB｣
  C => ｢CCC｣

The result is actually a very extensive object that has many ways to interrogate it. What you see above is just a built-in human readable view of it.

In most regular expression syntaxes to match equal amounts of `A`s and `B`s you would need to recurse in-between `A` and `B`. That of course wouldn't allow you to also do that for `C`. That also wouldn't be anywhere as easy to follow as the above. The above should run fairly fast because it never has to backtrack, or recurse.

When you combine them into a grammar, you will get a full parse-tree. (Actually you can do that without a grammar, it is just easier with one.)

To see an actual parser I often recommend people look at JSON::TINY::Grammar https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra...

Frankly from my perspective much of the design of "actual parsers" are a byproduct of limited RAM on early computers. The reason there is a separate tokenization stage was to reduce the amount of RAM used for the source code so that further stages had enough RAM to do any of the semantic analysis, and eventual compiling of the code. It doesn't really do that much to simplify any of the further stages in my view.

The JSON::Tiny module from above creates the native Raku data structure using an actions class, as the grammar is parsing. Meaning it is parsing and compiling as it goes.

link

ogogmad 588 days ago

I imagine this could be understood as making use of a monad. Right?

The main problem with generalised regexes is that you can't match them in linear time worst-case. I'm wondering if this is addressed at all by Raku.

link

jerf 588 days ago

A "monad" is not really a "thing" you can make use of, because a monad is a type of thing. Think "iterator"; an iterator is not a thing itself, it is a type of thing that things can be.

There is probably a monad you could understand this as being, a specific one, but "monad" itself is not a way to understand it.

And just as you can understand any given Iterator by simply understanding it directly, whatever "monad" you might use to understand this process can be simply understood directly without reference to the "monad" concept.

link

antononcube 588 days ago

> I imagine this could be understood as making use of a monad. Right?

Can you clarify what do you mean?

Do expect the concept of "monad" to help explaining Raku grammars?

link

ogogmad 588 days ago

Yes. Compare it to the List monad or Parsec.

link

antononcube 588 days ago

- There is a natural from-to conversion of Functional Parsers (FP) monad (as in Parsec) to Extended Backus-Naur Form (EBNF).

- Similarly, EBNF can be applied to Raku grammars.

- Hence, the representation of Raku grammars into FP monad is doable, at least for certain large enough set of Raku grammars.

  - See the package "FunctionalParsers".

link

xigoi 588 days ago

Why are they called regular expressions if they can parse non-regular languages?

link

db48x 587 days ago

It’s gradually got so. <https://youtu.be/JIlpjJnc6qY?t=54>

Literally, Larry Wall was adding things to regexes all the way back before the release of Perl 2.

link

donaldihunter 588 days ago

The word 'regular' comes from the mathematical roots of automata and finite state machines.

link

moomin 587 days ago

Which have a one to one correspondence with regular languages, so this isn’t actually an answer to the question.

link

tolmasky 588 days ago

I don't think we disagree here. To clarify, my statement about using "actual parsers" over regexes was more directed at my own library than Raku. Since I had just posted a link on how to "parse" media types using my library, I wanted to immediately follow that with a word of caution of "But don't do that! You shouldn't be using (traditional) regexes to parse! They are the wrong tool for that. How unfortunate it is that most languages have a super simple syntax for (traditional/PCRE) regexes and not for parsing." I had seen in the article that Raku had some sort of "grammar" concept, so I was kind of saying "oh it looks like Raku may be tackling that to."

Hopefully that clarifies that I was not necessarily making any statement about whether or not to use Raku regexes, which I don't pretend to know well enough to qualify to give advice around. Just for the sake of interesting discussion however, I do have a few follow up comments to what you wrote:

1. Aside from my original confusing use of the term "regexes" to actually mean "PCRE-style regexes", I recognize I also left a fair amount of ambiguity by referring to "actual parsers". Given that there is no "true" requirement to be a parser, what I was attempting to say is something along the lines of: a tool designed to transform text into some sort of structured data, as opposed to a tool designed to match patterns. Again, from this alone, seems like Raku regexes qualify just fine.

2. That being said, I do have a separate issue with using regexes for anything, which is that I do not think it is trivial to reason about the performance characteristics of regexes. IOW, the syntax "doesn't scale". This has already been discussed plenty of course, but suffice it to say that backtracking has proven undeniably popular, and so it seems an essential part of what most people consider regexes. Unfortunately this can lead to surprises when long strings are passed in later. Relatedly, I think regexes are just difficult to understand in general (for most people). No one seems to actually know them all that well. They venture very close to "write-only languages". Then people are scared to ever make a change in them. All of this arguably is a result of the original point that regexes are optimized for quick and dirty string matching, not to power gcc's C parser. This is all of course exacerbated by the truly terrible ergonomics, including not being able to compose regexes out of the box, etc. Again, I think you make a case here that Raku is attempting to "elevate" the regex to solve some if not all of these problems (clearly not only composable but also "modular", as well as being able to control backtracking, etc.) All great things!

I'd still be apprehensive about the regex "atoms" since I do think that regexes are not super intuitive for most people. But perhaps I've reversed cause and effect and the reason they're not intuitive is because of the state they currently exist in in most languages, and if you could write them with Raku's advanced features, regexes would be no more unintuitive than any other language feature, since you aren't forced to create one long unterminated 500-character regex for anything interesting. In other words, perhaps the "confusing" aspects of regexes are much more incidental to their "API" vs. an essential consequence of the way they describe and match text.

3. I'd like to just separately point out that many aspects of what you mentioned was added to regexes could be added to other kinds of parsers as well. IOW, "actual parsers" could theoretically parse Raku, if said "actual parsers" supported the discussed extensions. For example, there's no reason PEG parsers couldn't allow you to fall into dynamic sub-languages. Perhaps you did not mean to imply that this couldn't be the case, but I just wanted to make sure to point out that these extensions you mention appear to have much more generally applicable than they are perhaps given credit for by being "a part of regexes in Raku" (or maybe that's not the case at all and it was just presented this way in this comment for brevity, totally possible since I don't know Raku).

I'll certainly take a closer look at the full Raku grammar stuff since I've written lots of parser extensions that I'd be curious have analogues in Raku or might make sense to add to it, or alternatively interesting other ideas that can be taken from Raku. I will say that RakuAST is something I've always wanted languages to have, so that alone is very exciting!

link