| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by niconii 1552 days ago

Just to add to this, HTML parsers will attempt to make sense of any random line noise you give it and turn it into a DOM. When I say something is "invalid" HTML, what I mean is that it's not allowed by the spec and will result in an error if you run it through a validator (which you should do!)

For example, try running the following document through the W3C's HTML validator[1]:

  <!DOCTYPE html>
  <html lang=en>
  <title>Test Document</title>

  <p>
    <div></div>
  </p>

The HTML spec contains a list of all possible parse errors[2].

[1] https://validator.w3.org/nu/#textarea

[2] https://html.spec.whatwg.org/multipage/parsing.html#parse-er...

1 comments

jraph 1552 days ago

It is allowed by the spec afaik (the spec precisely tells browsers how to interpret this, so everything is perfectly specified in the spec boundaries - and so it's not "out of spec").

Of course, that screams "MISTAKE" that a validator should warn you about. Like a linter that would spot missing extra parentheses for an assignment in a if condition in C-like language. It is allowed to not put the parentheses, but it is recommended to put them.

And of course, that makes "Valid HTML" (almost?) redundant (There are probably "vocabulary" errors that are possible, like a missing src attribute for an img or a missing title tag in head - don't take my words on this though).

div in p is not invalid, it's outright impossible to obtain from HTML parsing.

You can obtain this by doing this in JavaScript:

    document.body.appendChild(document.createElement("p"));
    document.body.firstChild.appendChild(document.createElement("div")))

Or by parsing as XHTML:

    data:application/xhtml+xml;charset=utf-8,<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Hello</title></head><body><p><div></div></p></body></html>

You get:

    document.body.innerHTML

    <p xmlns="http://www.w3.org/1999/xhtml"><div/></p>

Which I realize is actually a bit scary, I go out of my way to write XHTML in the hope any error will be caught, but parsing as text/html actually produces a valid dom where parsing as XHTML won't necessarily.

link

niconii 1552 days ago

This is not quite right. The HTML spec specifies not only what browsers/user agents are allowed to do, but also what document authors are allowed to do.

While the HTML parser does handle errors, to conform to the spec, document authors must not make these errors.

Here is an excerpt from the spec[1]:

> As described in the conformance requirements section below, this specification describes conformance criteria for a variety of conformance classes. In particular, there are conformance requirements that apply to producers, for example authors and the documents they create, and there are conformance requirements that apply to consumers, for example web browsers. They can be distinguished by what they are requiring: a requirement on a producer states what is allowed, while a requirement on a consumer states how software is to act.

Furthermore, a user agent is not required to correct errors, and can simply halt at the first error[2]:

> The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

[1] https://html.spec.whatwg.org/multipage/introduction.html#how...

[2] https://html.spec.whatwg.org/multipage/parsing.html#parse-er...

link

jraph 1552 days ago

I stand corrected! Thank you. I should have read the spec before posting.

link