Hacker News new | ask | show | jobs
by _tef 5581 days ago
I have a couple of loosely connected points and links:

You want to buy 'Parsing Techniques 2nd Ed' -- this book is fantastic and will give you the understanding and clarity you seek. It is wonderfully comprehensive.

And schmitz' thesis is an excellent read on tools and techniques to deal with ambiguity http://www.lsv.ens-cachan.fr/~schmitz/papers

and the non-canonical parser that results is quite interesting too :-)

PEGs vs Boolean Grammars: I think adding negation and conjunction to a grammar is the same as positive and negative lookahead

PEGS + Ambiguity: Like other people, I solved this in my last top down parser by attaching both precedences and relations to terms

i.e E[50] :== E[< 50] + E[<= 50]

so you can force a left or right associative operator.

That said, although determinism is nice, I think ambiguous parses are inevitable once you have a sufficient level error handling within the parser

parsing theory is pretty developed and established but in reality many parsers are cobbled together top down parsers.

you can't use existing parse libraries to handle HTTP for example.

(That said, yakker is an approach with earley parsing to handle data dependent grammars)

2 comments

and while i'm here - the original earley paper is full of bugs.for a modern treatment you may find aycock & horspool's work on 'practical earley parsing' interesting, as well as the work on the Marpa Parser in perl.

to me: the earley parser and the packrat parser are similar in nature - they are both chart parsers. one is depth first, one is breadth first.

my question: can you engineer an earley parser that supports ordered choice and exploits this to avoid broadening the search too early -- ie can earley parsers support PEGs

I think it is possible and when i'm not drowning in my startup work i'd like to take a longer look at it.

fwiw my terrible attempts at an earley recognizer are here http://pastie.org/private/uezi1mxiur8dymdbh2scw

i've tried to use some ideas from the aycock and the aretz variants of the earley parser - avoiding lists of earley items at a character - instead storing the reductions/completions seperately from the scanning items, and using a driver rather than the predictor/completor loop (still need to add nullable support...)

I'm not sure the Earley paper "is full of bugs" - to the best of my knowledge, it has only one bug (albeit a biggy), which is in its description of how to create a parse tree.
it isn't cubic time either iirc
Yes, that's part of the same bug, as I understand it anyway (to be clear: I'm not sure that anyone's claiming that there are problems with Earley's recogniser, which is really the core of his paper). Scott's "SPPF-Style Parsing From Earley Recognisers" gives an overview of the bug and its implications.
I can't remember off hand if it dealt with nullable terms or hidden left recursion properly either.

don't get me wrong: I like the earley parser :-) I just think the original paper has some omissions and the treatment of earley parsing in 'parsing techniques 2nd ed' is thorough and includes substantial references and explanations of further work.

isn't the mongrel/thin/unicorn http parser library built using ragel (i.e. an existing parse library)? Could you expand on why you think it's not possible to handle http with existing tools?
ragel is for writing state machines and automata in.

many parsers are written as automata, but that does mean it is in the same category as parsing tools such as LR, LL, GLR, PEG or CFG based approaches.

yes, but the generated state machine is able to recognize a regular language, so if the automaton can recognize http it would mean that it is a regular language, and existing parsing tools should be able to deal with it, or am I missing something?
regular expressions (ala cs) are equiv to finite state machines. regular expressions can't count or match ()'s. ragel allows you to mix in code within the state machine, so it is actually far, far more powerful than a finite state machine.

in http, handling things like 'Content-Length: %d' and then reading a subsequent length is a little harder. as is handling transfer-chunked encoding. http is quite fiendish in places :-)

these are 'data dependent' - the parse stream contains information (i.e length) on the subsequent tokens - although some regexps have back references, these aren't common place in parsing tools/formalisms like LR,LL,LALR,CFG or PEGs.

my point is simply that a lot of the parsing drama of late has revolved around the simple task of parsing a language, rather than parsing network protocols.

there is a larger class of parsing problems that are still to be tackled.

Ah, now I get it what you meant, thanks for the explanation