Hacker News new | ask | show | jobs
by rodarima 36 days ago
Author here. I agree that you cannot go from HTML to XHTML because users and UA devs will always go towards "it mostly works".

However, I don't see it that clearly that this cannot be done since the start so that the expectations are right since the beginning. For example, I don't see the same problem in other formats like JPEG or PNG where you expect the image to work perfectly or fail with a decoding error.

Other than implementing it and see how it goes, can you propose a feasible experiment to see how an new strict spec will measurably fail?

3 comments

browsers will display invalid/corrupt images (best effort)

tried it right now - took a PNG and a JPEG, opened them in a text editor, literally deleted the second half of the file, saved, and dragged them into both Firefox and Chrome - they are displayed instead of erroring out.

there is a classic article why a minimal version of the web with features removed will fail - you removed 80% of the features that YOU think are not important. thats a classic fatal mistake

search the web for different proposals for a minimal web and you will understand - they will have removed some feature they think is bloat but which you kept in your proposal because you consider it critical. which is why you created a new proposal - their minimal proposal is not the right one for you

https://www.joelonsoftware.com/2001/03/23/strategy-letter-iv...

> they are displayed instead of erroring out.

I think what is lost on many people, ironically even the ones who want to retvrn the web to its former glory, is that the browser tries to display broken, half transmitted content because it happened so frequently due to circumstances completely out of the website operator or the user's control. And in most cases showing a half transmitted web page with half of the closing tags missing is almost certainly better than just outright refusing to show anything.

Couldn't that be a source for vulnerabilities?
Missing closing tags in html no.
I could imagine a page where cutting HTML would cause it be a yes (not exact JS).

  <script>
    setTimeout(10000, () => {
      safeEval(<some user input>);
    });
  </script>
  <script>
    window.safeEval = code => eval(code);
  </script>

  <!-- cut the page here -->
  <!-- the prev and next tags around this comment could be combined in one and cut in the middle if the browser autocloses them and treats as valid script after -->

  <script>
    <!-- safety fixed! -->
    const notTooSafe = window.safeEval;
    
    window.safeEval = code => {
      if (code.any(c => !c.isDigit())) throw "unsafe";
      return notTooSafe(code);
    };
  </script>
Parent poster was talking about the latter half of a page being missing, rather than a chunk out of the middle, I believe.
> I agree that you cannot go from HTML to XHTML because users and UA devs will always go towards "it mostly works".

That... is not how anything happened.

> I don't see the same problem in other formats like JPEG or PNG where you expect the image to work perfectly or fail with a decoding error.

Browsers absolutely decode as much as they can, and if the file is corrupted halfway through you generally get garbling, not the entire image being replaced by "fuck off". The only case where that is so is if the browser can't parse anything at all, or can't retrieve the file.

> Other than implementing it and see how it goes, can you propose a feasible experiment to see how an new strict spec will measurably fail?

We already did that and saw where it went.

> Browsers absolutely decode as much as they can, and if the file is corrupted halfway through you generally get garbling, not the entire image being replaced by "fuck off". The only case where that is so is if the browser can't parse anything at all, or can't retrieve the file.

What I meant is that you don't expect PNG or JPEG images to be created in a way that the parser needs to run a complex process to reconstruct the bits that are broken and interpret what you meant to say. Like this one:

https://html.spec.whatwg.org/multipage/parsing.html#adoption...

Perhaps a better example is a C program being compiled into an executable. You don't expect the compiler to guess what you meant while parsing.

The current expectation is that a web browser must load any broken HTML and still display what it can, and is this expectation what I would like to change.

I don't propose humans to write this format directly (although it should be human readable), but compile it from something that is easy to write, like Markdown or a similar language. The objective is to enforce tools that make the transformation to produce a strictly conformant document.

Having a context-free grammar allows simple and fast parsing tools that can process your document, in a similar way that you can query or manipulate a JSON file with tools like jq because the grammar is simple and strict.

On the contrary, image decoders all run complex processes that try and guess what to do in erroneous cases. I used to maintain Chrome's image decoders, and every single image format has "what the spec says" and then "what people actually do in practice"; you must handle the latter, and it is often very difficult to figure out how to do so. For BMPs, for example, determining whether the author intended 24-bit RGB or 32-bit RGBA sometimes requires decoding the full image and scanning to see whether any pixels' alpha bytes differ from the others, since "all 00" and "all FF" might both be "no alpha".

I also used to work on a production C compiler. Compilers can and do "guess what you meant" in various cases, notably for producing actual human-readable errors or proceeding past various warnings, but if I recall correctly even in more obscure non-error cases.

Hyrum's Law is a real jerk sometimes.

> For BMPs, for example, determining whether the author intended 24-bit RGB or 32-bit RGBA sometimes requires decoding the full image and scanning to see whether any pixels' alpha bytes differ from the others, since "all 00" and "all FF" might both be "no alpha".

This causes a situation in which a page that renders the image in Chrome doesn't work in Firefox or other browsers that don't implement the same non-standard correcting algorithm. Worse, the user doesn't have a way to know from Chrome that the image is broken and it will likely continue to be broken forever.

Generalizing this approach, you end up having to test your site in every major browser to see if you didn't made a mistake that is only revealed in the browser which lacks that recovery mechanism.

> I also used to work on a production C compiler. Compilers can and do "guess what you meant" in various cases, notably for producing actual human-readable errors or proceeding past various warnings, but if I recall correctly even in more obscure non-error cases.

I don't think this is addressing my point. When you write a C program, you expect the compiler to either recognize the program from the C grammar, or reject it because it is not correct (hence the concept of "error"). Then run whatever guessing algorithm to report to you what may be wrong.

The programmer expectation is that the program must strictly conform to the C grammar, and errors are corrected. It is not silently producing a half-reconstructed program assuming what you meant to say.

> What I meant is that you don't expect PNG or JPEG images to be created in a way that the parser needs to run a complex process to reconstruct the bits that are broken and interpret what you meant to say.

So what you meant is neither what you wrote, nor what you advocate for?

because in case you have forgotten here is what you advocate for:

> Pages that don't conform with the specification won't be rendered.

That is not how image rendering in browsers works. That is how XHTML does not work.

> Perhaps a better example is a C program being compiled into an executable.

It's not a better example, because it's a completely different and unrelated use case: C programs are usually not dynamically generated, and even when they are the person who compiles the code is usually either the person who wrote it or a person who has ways to fix it or report errors.

Not so when trying to read a web random page on the web.

> I don't propose humans to write this format directly (although it should be human readable)

Approximately nobody wrote xhtml by hand, didn't save it.

This is also a nonsensical constraint-set on its face, there is no point to a human readable format which is not human writeable.

> The objective is to enforce tools that make the transformation to produce a strictly conformant document.

Ah, an open and non-monopolizable format which can only be written via an official toolchain.

> Having a context-free grammar allows simple and fast parsing tools that can process your document in a similar way that you can query or manipulate a JSON^H^H^H^HXML file with tools like jq^H^Hxmlstarlet because the grammar is simple and strict.

None of which seems of any use to a format which pretends to human production and consumption. JSON is an interchange format between machines.

> Pages that don't conform with the specification won't be rendered.

I agree on what I wrote here. They will fail with an error indicating where the mistake is so you can correct it (more likely the tool that produced it).

>> The objective is to enforce tools that make the transformation to produce a strictly conformant document. > >Ah, an open and non-monopolizable format which can only be written via an official toolchain.

??

The objective is that when you make a tool like markdown-to-foo, the output follows the spec. There is no mention of any "official toolchain".

> xmlstarlet

XML is strict. Try to find the same tool for HTML5, especially for transformations.

> JSON is an interchange format between machines.

Is pretty much what the specification would try to cover.

Instead of C, maybe, Lisp, or Forth without messing the stack.
> That... is not how anything happened.

What the heck are you talking about? User agent devs and users did indeed always go toward it mostly works.

People didn't go towards "it mostly works", people go towards "it works at all". A lot of people tried to use xhtml, and it didn't work, broken content was pervasive and the experience when facing broken content was irredeemable.
What was the exact nature of how devs found themselves unable to emit valid XML in all scenarios? What kind of bugs did they run into?
as a person who just wants to publish a simple blog and informational articles, i would happily use this subset if it were still compatible with popular browsers. i've been praying for an effort like this to organize and am grateful that you are taking a stab at it. i would use dillo as my main testing browser if it was the browser that honored the subset spec most accurately and guaranteed compatibility with Chrome and Firefox.