| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by DevelopingElk 125 days ago

The issue with Regex for parsing is it can't handle balanced parentheses. https://en.wikipedia.org/wiki/Regular_expression. More generally, they can't handle nested structure. Context free grammars are the most natural extension that can. It adds a substitution operator to Regex that makes it powerful enough to recognize nested structure. So, Regex would be reinvented if history was rerun, but so would Context Free Grammars. Part of the complexity in parsing is attaching semantic meaning to the parse. Regex mostly avoids this by not caring how a string matches, just if it matches or not.

Now, I do agree that LR grammars are messy. Nowadays, they have mostly fallen from favor. Instead, people use simpler parsers that work for the restricted grammars actual programming languages have.

IIRC there is some research into formalizing the type of unambiguous grammar that always uses () or [] as nesting elements, but can use Regex for lexing.

1 comments

ogogmad 125 days ago

I understand what a CFG is and why Dyck's language (matching parens) is not a regular language. My point was that CFG/CFL is less motivated by a reasonable and uniquely characterising constraint - such as making memory usage independent of the size of an input string - than regex is.

Then again, you are right that CFGs are very natural. And they do admit a few easy O(n^3) parsing algorithms, like Earley and CYK.

I think your last sentence relates to Visible Pushdown Grammars. See also Operator Precedence Grammars.

link