| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kaoD 4305 days ago

> But it's not possible to write a CFG for Markdown because of Markdown's requirement that anything is valid input.

Honest question here: how do CFGs prevent you from parsing anything as valid input? E.g. AFAICT this CFG in BNF accepts anything as valid input (including no input!)

    <s> ::= <x> EOF
          | EOF

    <x> ::= CHAR <x>
          | CHAR

Isn't it just a problem of crafting a proper grammar which covers all cases? I can see why HTML needs other parsing strategies, but it should be easy for Markdown since everything that is not a non-paragraph is a paragraph (or a blank line, i.e. a paragraph separator).

Also, isn't there a compromise between HTML's crazy parsing strategy and a CFG? A formal grammar, even if not context-free.

1 comments

roop 4305 days ago

True, we can write a CFG that can accept any input, but not one that can parse Markdown.

Actually, I should have said it's not possible to write an _unambiguous_ CFG for Markdown.

Say we need to parse emphasis in span elements. "_a_" is em and "__a__" is strong, but "_a", "a_", "__a" and "a__" are normal text. If we write the rules for all these, we end up with a grammar than can generate the same string in many different ways. To determine whether an "_" is the syntax qualifier of an em or just part of normal text, we might have to look ahead an arbitrary number of characters, and potentially till the end of the input. This is why it's not possible to write a useful (or unambiguous) CFG for Markdown, and this is because of the requirement to not throw an error on any input.

> Also, isn't there a compromise between HTML's crazy > parsing strategy and a CFG?

PEGs have been written for Markdown and they work because PEGs are inherently unambiguous, but use backtracking instead. But those PEGs don't handle nested blocks cleanly.

My own HTML5-ish Markdown spec (http://www.vfmd.org/vfmd-spec/specification/) is not as crazy as HTML5's, but admittedly, is not trivial to implement either.

link

kaoD 4304 days ago

Interesting reply.

> But those PEGs don't handle nested blocks cleanly.

What's the problem exactly?

link

roop 4304 days ago

With PEGs, it's not possible to express nested content, so we have to collate for example the blockquoted content of blockquotes and process them separately. Per my understanding (and I might be wrong here), part of the problem is that PEG includes tokenization, so it's not possible to handle blockquote-levels in the tokeniser separately.

More details on PEG for Markdown: - from the author of the PEG grammar, who's also the spec-writer of "Common Markdown": http://talk.standardmarkdown.com/t/standard-markdown-formal-... - from myself: http://www.vfmd.org/introduction/#prior-work

Also, I wrote an expanded explanation of essentially what I said about CFGs and Markdown: http://roopc.net/posts/2014/markdown-cfg/

link