| The problem is that there's no formal grammar and the spec of "Standard Markdown", while being more specific than John Gruber's, is still full of ambiguities. Some examples of ambiguities: 1. It does not specify precedence. For example, if a line like "~~~" (or "[ref]: /url") is followed by a setext underline, is that a header, or is that the start of a fenced code block (or ref definition)? 2. The spec says: "Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks". It says as an example that "<a href="`">`" is a HTML tag. What happens for different placement of backticks, like "<a `href=""`>" or even "`<a href="">`" is left unspecified. 3. What is the precedence or associativity of span-level constructs? For example, does "<asterisk>a[b<asterisk>](url)" result in "a[b" being emphasised or "b<asterisk>" being linked? Thing is, a specification-by-example like this would have to keep an ever-growing list of corner cases and give examples for each of them. To get completely unambiguous, the list needs to be very long, and when it gets very long, it becomes unwieldy to handle for an implementer of the spec. Hence the need for a formal grammar, which is the shortest way of expressing something unambiguously. But it's not possible to write a CFG for Markdown because of Markdown's requirement that anything is valid input. So the next best thing is to define a parsing algorithm, like the HTML5 spec. (Shameless plug: vfmd (http://www.vfmd.org/) is one such Markdown spec which specifies an unambiguous way to parse Markdown, with tests and a reference implementation.) So if "Standard Markdown" is NOT unambiguous and wouldn't be, then it's not a "standard", so calling it "Standard Markdown" is not quite proper. |
The C and javascript implementations use a parsing algorithm that we could have simply translated into English and called a spec. (That's the sort of spec vfmd gives.) But it seemed to us that there was value in giving a declarative specification of the syntax, one that was closer to the way a human reader or writer would think, as opposed to a computer.
Re (3): we have an asterisk which can open emphasis. So, to see if we have emphasis, the rules say to parse inlines sequentially until an asterisk that can close emphasis is reached. The first inline we come to is [b*](url), which is a link. There's no closing asterisk, so we don't have emphasis, but a literal asterisk followed by a link.
Re (1): I believe you are right that the case of a referenc e definition before a setext header line should be clarified. However, the other case seems clear enough. ~~~ starts a fenced code block, which ends with a closing string of tildes or the end of the enclosing container. The underline would be included in that code block either way.
Re (2): I believe the talk of precedence may be misleading here (I thought it would be useful heuristically). The basic principle of inline parsing is to go left to right, consuming inlines that match the specs. This resolves all of these cases. Perhaps the talk of precedence should be removed.
I am no stranger to formal specifications. I wrote what I think was the first PEG grammar for markdown (peg-markdown, which came to be used as the basis for multimarkdown and several other implementations). PEG isn't a good fit, especially for block-level parsing. It almost works for inline-level parsing, but there are some constructs (like code spans) that can't be done in PEGs. It might be worth specifying inline parsing in a pseudo-PEG format to avoid worries like those you've expressed.