Hacker News new | ask | show | jobs
by BruceIV 4663 days ago
It sounds to me like he wants a PEG parser - PEG is a formalization of a recursive descent parsers that operates in a single pass and can quite easily handle the "parse different things in different contexts" problem (like the inline assembly the author mentions).

It's not terribly mature yet, but I've been poking away at building a PEG parser generator myself (mostly as an exercise to learn C++11): https://github.com/bruceiv/egg

1 comments

PEG parsers use backtracking and do not operate in a single pass. They use memoization to avoid the combinatorial explosion, but that trades one problem for another (though of lesser magnitude).

Recursive descent parsers are normally LL(k) and are O(n) in the size of the input - choices are made based on the next k tokens and no backtracking or alternate grammar paths are generally ever tried. Though they are not particularly efficient when rules nest deeply on the left before consuming tokens. It's better to use an operator precedence parser for parsing arithmetic expressions than use LL(1) rules to handle precedence, for example.

Having a separate lexer has other benefits besides eliminating explicit whitespace. It enforces consistency in the token set, which can otherwise seem arbitrary - a keyword has a special meaning in one part of the grammar, but not elsewhere. And this in turn helps with keeping backward compatibility when growing a language, because you can know for a fact that using a keyword elsewhere in the grammar doesn't introduce new problems.

PEG parsers use backtracking and do not operate in a single pass.

Backtracking + memoization is but one implementation technique for PEG parsers. Another such technique is akin to dynamic programming -- fill in a table with possible parsing outcomes as you go. No "backtracking" (which really, is moot in a memoized PEG parser) required.

Either way, the performance is identical to that of a recursive descent parser (linear in the size of the input and number of nonterminals).

Recursive descent parsers are linear in the size of the input, but they're also linear in the nesting depth of the grammar, unlike e.g. PDAs or operator precedence parsers, which is why they're a poor choice for things like math expressions.

And performance is not just time, but also space.

There are niches for most parsing algorithms, but PEGs are a poor fit for most tasks except lightweight ad-hoc use, especially implementation languages for which tool support is poor.

Perhaps, but it also makes it annoying to add new keywords while keeping backwards compatibility even if it can be done entirely unambiguously. For example, C++11 needed to resort to the awkward idea of "identifiers with special meaning" to add override and final.
By "single pass" I meant simply that you don't need a separate lexing phase (though as some of the other commenters have mentioned, it is possible to do a PEG parser in linear time using memoization or dynamic programming)