Hacker News new | ask | show | jobs
by tannhaeuser 1255 days ago
I really like the paper, though I'm not sure the world needs another Turing-complete document language however well-motivated ;)

As SGML pedant however, I can't resist commenting on the following:

> The second offspring of SGML is XML, specified by the World Wide Web Consortium (W3C) in 1998. It has a reduced feature set compared to SGML (for example, it forbids unclosed tags and concurrent markup). But it retains the most important aspect of SGML, one that HTML is lacking: The ability to define custom structural elements. This lets XML represent documents with much more semantic detail than HTML.

As the SGML vocabulary HTML was once envisioned, HTML itself doesn't need extensibility. When used as an SGML application, defining your own elements in HTML is as easy as declaring those in the "internal subset" or in a custom DTD right away. Assuming any wellformed element is accepted as of ISO 8879 Annex K's FEATURES IMPLYDEF ELEMENT rather than rejecting undeclared elements, that's actually only necessary if you want to validate/infer custom content models, or use any of the other things markup declarations provide, such as custom SHORTREF syntax a la markdown.

Arguably, HTML5's "custom elements" do provide a facility to define new elements, if incredibly lousy; ie. custom elements can't have content model restrictions (see above) and can't be used with tag omission/inference (important for customized elements), aren't integrated with DOM parsing, and need JavaScript for declaration - the latter point making them completely pointless as a markup feature.

2 comments

But HTML never was “an SGML application” in practice, and I highly doubt it was ever actually envisioned as that. There may have been some tools out there that processed HTML as SGML, but none of the ones I know of did (most notably browsers).

And in fact, in practice you could just use your own custom elements without worrying about validity and it’d mostly just work. This wasn’t even particularly rare. (There was the whole “CSS doesn’t work on them until you call document.createElement("…")” bug in IE, but that’s the only problem I can think of, and it was easily worked around.)

HTML the markup language was clearly intended as an SGML vocabulary - TBL himself said as much [1] and HTML also reused element names from the SGML spec/handbook as example/folklore vocabulary such as for paragraphs and headings.

What browsers made out of it isn't the matter here, but even if it were, the "practical, real-world HTML out there" argument is mostly used to pull up the ladder by an ad company/browser cartel made worse day-in day-out through an atrocious and absurdly voluminous HTML spec (and by CSS, of course).

Even though Ian Hickson, of WHATWG, wanted to capture HTML as it was understood by browsers, he couldn't help but added additional elements of his own - such as for marking up ads as "aside" lol plus the alien sectioning elements concept that gave rise to the flawed "outline algorithm" and misuse of heading elements (and earlier failure to understand SGML's RANK feature), a problem that was only fixed last year [2] by an incompatible change to HTML invalidating documents using hgroup as originally advised.

In practice, very few changes to the HTML syntax brought HTML outside SGML - for the most part, ad-hoc and basically unnecessary commenting rules for the script and style elements to keep legacy browsers from rendering JavaScript and CSS, resp., when those where introduced.

[1]: http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html

[2]: https://github.com/w3c/htmlwg/issues/22

Look, I’ve got to say it. Your entire approach to SGML (in this and many other threads) simply doesn’t match reality now, and, as far as I can tell (though it was before my time), never matched reality in matters pertaining to HTML and XML.

You seem to always start with the assumption that SGML was (and perhaps is) an end goal. I deny this.

HTML was designed as an SGML vocabulary, but, where it mattered, never implemented as an SGML vocabulary. If Tim Berners-Lee ever even expected it to be treated as SGML very much, I suspect he hadn’t thought things through well enough (though that could also just be hindsight bias on my part).

There has never been any particular virtue in HTML being an SGML vocabulary. No one that mattered (which mainly means browsers) cared about SGML, then or now, and no web developers or end users care about SGML, so being SGML is just needless complication and potential for confusion (due to that implying different behaviour from reality). SGML is a hideous, complex beast that no one wants to work with, and which almost everyone that has heard of it is glad is dead.

Yes, SGML had some nice ideas. Yes, we keep on reinventing parts of it. Yes, a variant of Greenspun’s tenth rule applies. But SGML was just too flexible/generic, large and ugly. It doesn’t actually solve things. And the current HTML parser is the best thing since sliced bread and my favourite popular file type spec by a large margin despite its size, because it’s clear, unambiguous, and implementable.

> There has never been any particular virtue in HTML being an SGML vocabulary. No one that mattered (which mainly means browsers) cared about SGML

You can care about browsers, I care about documents and that they can be read and understood in a couple decades still. Preferably without kissing the ring of an ad company.

Defining your own vocabularies and SGML is also directly mentioned in the paper being discussed. SGML lets you define your own custom language and mapping to HTML as output/rendering language without further tools.

> SGML is a hideous, complex beast that no one wants to work with

As opposed to what? The web platform specs covering all of HTML, CSS, and JS roughly a thousand times the size of the SGML spec? Have you actually studied SGML or implemented a parser for a markup language, or are you repeating what you've heard elsewhere?

> And the current HTML parser is the best thing since sliced bread and my favourite popular file type spec by a large margin despite its size, because it’s clear, unambiguous, and implementable.

Which version of WHATWG HTML5? Oh, WHATWG don't bother versioning their phone-book sized specs. And parsing breaks all the time; eg. current head doesn't contain the param element anymore (as content of the object element still in the spec) which however requires that no end tag is specified, hence a parser for current WHATWG HTML will fail hard in the presence of param elements (similar story with legacy elements such as keygen). Then there are new "boolean attributes" being introduced all the time requiring special rules/markup declarations ...

With respect, the argument isn't particularly relevant anyway as those specs aren't aimed at folks having difficulties following a formal language spec/grammar but need procedural step-by-step instructions instead.

>I really like the paper, though I'm not sure the world needs another Turing-complete document language however well-motivated ;)

What would you recommend instead? From my experience of the beta it is vastly nicer to use than latex, and I'm not really aware of any other competition.