| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TazeTSchnitzel 40 days ago

> The specification must contain a non-ambiguous formal grammar that can be parsed easily. A page can then be tested against the standard and reject or accept as compliant. Pages that don't conform with the specification won't be rendered. It is explicitly forbidden for clients to accept any page that doesn't conform with the specification.

This is what XHTML was, and it was a complete disaster. There's a reason almost nobody serves XHTML with the application/xhtml+xml MIME type, and that reason is that getting a “parser error” (this is what browsers still do! try it!) is always worse than getting a page that 99% works.[0] I strongly believe that rejecting the robustness principle is a fatal mistake for a web-replacement project. The fact that horribly broken old sites can stay online and stay readable is a huge part of the web's value. Without that, it's not really “the web”, spiritually or otherwise.

[0] It's particularly “cool” how they simply do not work in the Internet Archive's Wayback machine. The page can be retrieved, but nobody can read it.

9 comments

TFNA 40 days ago

XHTML failed in an era when writers (even normies) were writing some HTML of their own and they could't be trusted to close their tags properly. XHTML also assumed writers would be personally invested in semantic markup like distinguishing e.g. the italics of book titles from the italics of emphasis.

Today, when writers are using visual editors (or Markdown), few are writing their own HTML any more. A web standard requiring compliance would work differently today.

PaulHoule 40 days ago

Markdown sux and so do visual editors. I think visual editors were just invented to make it so cut-and-paste never quite works right. There's been some conceptual problem with the whole idea ever since MS Word and the industry has never dealt with it.

hdjrudni 38 days ago

I think it could conceivably work. I vaguely remember WinForms being decent. You have to restrict what the user can do a bit. Have things snap into allowed positions.

CM30 39 days ago

The issue is that even if you're not writing your own code, you're reliant on your CMS or framework, its plugins and imports, any advertising networks you use, etc not to break your site. There are already issues where those things cause server errors or end up being incompatible with other additions when upgraded, but giving them another way to break your entire site just makes such things even more of a hassle.

intrasight 40 days ago

> XHTML failed in an era when writers (even normies) were writing some HTML of their own

I'd say it was a minority of writers that were handcrafting XHTML. And it was the case that everyone or their handcrafting or using tools could validate their compliance using a browser which made it very easy to adjust your tools or your handcrafted code. We are now in a situation where there is no schema for HTML.

I, for one, am very much in favor of forking the web with a document format with a schema. It really seems like a small and simple change to me.

TFNA 40 days ago

Note that when I say "writing their own HTML", I don't mean handcrafting a whole webpage. I mean that people were writing i or b tags in their Wordpress editors or in online comment boxes, because back then such text fields did not have visual editors and would accept raw tags. Under XHTML, if the writer did not close tags properly, such input would have broken the whole page, so obviously back then such a standard was DOA.

singpolyma3 40 days ago

Those cases were easy to fix by using eg htmltidy on the UGC.

Honestly I don't think it was killed by one thing, or by anything. Just no platform really cared and it wasn't a win for anyone and occasionally a loss.

maxerickson 40 days ago

No scripting is a tell, it's about wanting other people to accommodate their concerns about running a complex browser, not about solving a real problem.

If it did somehow happen that a good deal of interesting content was published using the standard, the most popular client would probably be nonconforming, ignoring the rule to not render ambiguous content.

krapp 40 days ago

Every modern alternative web protocol is about accommodating the author's concerns and pet peeves about the modern web (and usually gatekeeping it from capitalists and normies.)

Protocols used to be limited by technology, now they're defined by ideology.

fooqux 40 days ago

Agreed. There may be some situations where I may want to ensure 100% correctness. I'm thinking life or death scenarios, (which if so, maybe should use a different protocol). However, checking the sports score or looking at cat memes isn't that.

pibaker 40 days ago

There are also life and death scenarios where being able to show a broken page saves lives. Imagine there is a storm coming in your area and the government website listing addresses of emergency shelters is barely loading because it is overloaded or because your phone signal is bad. Being able to just load and show half of the page's html content is still better than nothing.

gershy 39 days ago

I think anyone of sufficient intelligence can devise an argument to frame anything in life-and-death terms.

Doesn't error tolerance promote developer habits that could lead to complete downtime? During which lives could be lost? Don't our current standards result in more churn of physical hardware? Which winds up in garbage dumps in poor countries? On fire, with toxic fumes? Being picked over by labourers, breathing it in? And losing their lives early?

culi 40 days ago

When you visit an HTTP site, browsers give you a warning screen alongside an option to "open anyways".

We could do the same with sites that are not 100% correct. User are already used to having to click "Open anyways" for older, non HTTPS, sites anyways

Bratmon 40 days ago

Browsers briefly tried that in the early 00s. It turns out that, from a user perspective, that's an incredibly stupid question- the user has no way of knowing how well the page works until they click "yes"!

culi 38 days ago

The same can be said for the security of an HTTP site.

nofriend 40 days ago

The reason is that clients, even under xhtml, expect to be able to build webpages via templating. You need to reject that assumption and demand that servers build pages from an ast so that the backend guarantees that the page parses. It isn't hard to do, it's just the xhtml never got far enough to try it.

singpolyma3 40 days ago

To be fair, HTML5 also has a defined parsing algorithm. It just happens to always work on any input to produce a webpage

jerf 40 days ago

Yes, this is what you'd want. It doesn't have to be a complicated as the HTML5 algorithm either. That's complicated because it was a harmonization of at least 3 browser's multi-decade heuristics and untold terabytes of existing HTML practice. An algorithm unconcerned with backwards compatibility could much simpler, but still clearly define error behavior much easier to use than "scream and die".

And it's still unambiguous. You can cringe at what some people do, but it would be strictly a taste issue rather than a technical one, as the parse would still be unambiguous. And if you think you can fix taste issues with technical specification, well, you've already lost anyhow.

stavros 40 days ago

I think the GP has an issue not with the specification part, but with the part where it's forbidden for clients to render a noncompliant page.

tardedmeme 40 days ago

It's not forbidden. They just don't render certain noncompliant pages. Namely the ones with gross syntax errors.

Why are we okay with formats like PDF that have similarly catastrophic error handling?

zbentley 40 days ago

I mean, we aren’t ok with that for PDF. That’s why PDF renderers have incredibly baroque rules for parsing weirdly or brokenly formatted documents, and why many PDF documents fall back to embedding images or absolute-positioned pixel-like layouts for compatibility purposes.

stavros 40 days ago

I mean, the linked page and the comment above say it is:

> It is explicitly forbidden for clients to accept any page that doesn't conform with the specification. This prevents the standardized diabolic rules that one must implement in order to correct a

masklinn 40 days ago

I don't get this reply. GP didn't say anything about parsing algorithms, they said (correct) things about hard errors on the web.

112233 40 days ago

why for? the reply is about factual historical experience with webpage hard errors.

Would you like to have a law that forbids you, under penalty of fine, to read any book you buy or borrow that is lacking or has damaged pages?

jazzypants 40 days ago

I thought they were just bolstering the refutation of TFA's assertion that XHTML is strictly better because of its parsing algorithm.

rodarima 40 days ago

Author here. I agree that you cannot go from HTML to XHTML because users and UA devs will always go towards "it mostly works".

However, I don't see it that clearly that this cannot be done since the start so that the expectations are right since the beginning. For example, I don't see the same problem in other formats like JPEG or PNG where you expect the image to work perfectly or fail with a decoding error.

Other than implementing it and see how it goes, can you propose a feasible experiment to see how an new strict spec will measurably fail?

htmlenjoyye 40 days ago

browsers will display invalid/corrupt images (best effort)

tried it right now - took a PNG and a JPEG, opened them in a text editor, literally deleted the second half of the file, saved, and dragged them into both Firefox and Chrome - they are displayed instead of erroring out.

there is a classic article why a minimal version of the web with features removed will fail - you removed 80% of the features that YOU think are not important. thats a classic fatal mistake

search the web for different proposals for a minimal web and you will understand - they will have removed some feature they think is bloat but which you kept in your proposal because you consider it critical. which is why you created a new proposal - their minimal proposal is not the right one for you

https://www.joelonsoftware.com/2001/03/23/strategy-letter-iv...

pibaker 40 days ago

> they are displayed instead of erroring out.

I think what is lost on many people, ironically even the ones who want to retvrn the web to its former glory, is that the browser tries to display broken, half transmitted content because it happened so frequently due to circumstances completely out of the website operator or the user's control. And in most cases showing a half transmitted web page with half of the closing tags missing is almost certainly better than just outright refusing to show anything.

lostmsu 40 days ago

Couldn't that be a source for vulnerabilities?

ipaddr 40 days ago

Missing closing tags in html no.

lostmsu 39 days ago

I could imagine a page where cutting HTML would cause it be a yes (not exact JS).

  <script>
    setTimeout(10000, () => {
      safeEval(<some user input>);
    });
  </script>
  <script>
    window.safeEval = code => eval(code);
  </script>

  <!-- cut the page here -->
  <!-- the prev and next tags around this comment could be combined in one and cut in the middle if the browser autocloses them and treats as valid script after -->

  <script>
    <!-- safety fixed! -->
    const notTooSafe = window.safeEval;
    
    window.safeEval = code => {
      if (code.any(c => !c.isDigit())) throw "unsafe";
      return notTooSafe(code);
    };
  </script>

masklinn 40 days ago

> I agree that you cannot go from HTML to XHTML because users and UA devs will always go towards "it mostly works".

That... is not how anything happened.

> I don't see the same problem in other formats like JPEG or PNG where you expect the image to work perfectly or fail with a decoding error.

Browsers absolutely decode as much as they can, and if the file is corrupted halfway through you generally get garbling, not the entire image being replaced by "fuck off". The only case where that is so is if the browser can't parse anything at all, or can't retrieve the file.

> Other than implementing it and see how it goes, can you propose a feasible experiment to see how an new strict spec will measurably fail?

We already did that and saw where it went.

rodarima 40 days ago

> Browsers absolutely decode as much as they can, and if the file is corrupted halfway through you generally get garbling, not the entire image being replaced by "fuck off". The only case where that is so is if the browser can't parse anything at all, or can't retrieve the file.

What I meant is that you don't expect PNG or JPEG images to be created in a way that the parser needs to run a complex process to reconstruct the bits that are broken and interpret what you meant to say. Like this one:

https://html.spec.whatwg.org/multipage/parsing.html#adoption...

Perhaps a better example is a C program being compiled into an executable. You don't expect the compiler to guess what you meant while parsing.

The current expectation is that a web browser must load any broken HTML and still display what it can, and is this expectation what I would like to change.

I don't propose humans to write this format directly (although it should be human readable), but compile it from something that is easy to write, like Markdown or a similar language. The objective is to enforce tools that make the transformation to produce a strictly conformant document.

Having a context-free grammar allows simple and fast parsing tools that can process your document, in a similar way that you can query or manipulate a JSON file with tools like jq because the grammar is simple and strict.

pkasting 40 days ago

On the contrary, image decoders all run complex processes that try and guess what to do in erroneous cases. I used to maintain Chrome's image decoders, and every single image format has "what the spec says" and then "what people actually do in practice"; you must handle the latter, and it is often very difficult to figure out how to do so. For BMPs, for example, determining whether the author intended 24-bit RGB or 32-bit RGBA sometimes requires decoding the full image and scanning to see whether any pixels' alpha bytes differ from the others, since "all 00" and "all FF" might both be "no alpha".

I also used to work on a production C compiler. Compilers can and do "guess what you meant" in various cases, notably for producing actual human-readable errors or proceeding past various warnings, but if I recall correctly even in more obscure non-error cases.

Hyrum's Law is a real jerk sometimes.

rodarima 39 days ago

> For BMPs, for example, determining whether the author intended 24-bit RGB or 32-bit RGBA sometimes requires decoding the full image and scanning to see whether any pixels' alpha bytes differ from the others, since "all 00" and "all FF" might both be "no alpha".

This causes a situation in which a page that renders the image in Chrome doesn't work in Firefox or other browsers that don't implement the same non-standard correcting algorithm. Worse, the user doesn't have a way to know from Chrome that the image is broken and it will likely continue to be broken forever.

Generalizing this approach, you end up having to test your site in every major browser to see if you didn't made a mistake that is only revealed in the browser which lacks that recovery mechanism.

> I also used to work on a production C compiler. Compilers can and do "guess what you meant" in various cases, notably for producing actual human-readable errors or proceeding past various warnings, but if I recall correctly even in more obscure non-error cases.

I don't think this is addressing my point. When you write a C program, you expect the compiler to either recognize the program from the C grammar, or reject it because it is not correct (hence the concept of "error"). Then run whatever guessing algorithm to report to you what may be wrong.

The programmer expectation is that the program must strictly conform to the C grammar, and errors are corrected. It is not silently producing a half-reconstructed program assuming what you meant to say.

masklinn 40 days ago

> What I meant is that you don't expect PNG or JPEG images to be created in a way that the parser needs to run a complex process to reconstruct the bits that are broken and interpret what you meant to say.

So what you meant is neither what you wrote, nor what you advocate for?

because in case you have forgotten here is what you advocate for:

> Pages that don't conform with the specification won't be rendered.

That is not how image rendering in browsers works. That is how XHTML does not work.

> Perhaps a better example is a C program being compiled into an executable.

It's not a better example, because it's a completely different and unrelated use case: C programs are usually not dynamically generated, and even when they are the person who compiles the code is usually either the person who wrote it or a person who has ways to fix it or report errors.

Not so when trying to read a web random page on the web.

> I don't propose humans to write this format directly (although it should be human readable)

Approximately nobody wrote xhtml by hand, didn't save it.

This is also a nonsensical constraint-set on its face, there is no point to a human readable format which is not human writeable.

> The objective is to enforce tools that make the transformation to produce a strictly conformant document.

Ah, an open and non-monopolizable format which can only be written via an official toolchain.

> Having a context-free grammar allows simple and fast parsing tools that can process your document in a similar way that you can query or manipulate a JSON^H^H^H^HXML file with tools like jq^H^Hxmlstarlet because the grammar is simple and strict.

None of which seems of any use to a format which pretends to human production and consumption. JSON is an interchange format between machines.

rodarima 40 days ago

> Pages that don't conform with the specification won't be rendered.

I agree on what I wrote here. They will fail with an error indicating where the mistake is so you can correct it (more likely the tool that produced it).

>> The objective is to enforce tools that make the transformation to produce a strictly conformant document. > >Ah, an open and non-monopolizable format which can only be written via an official toolchain.

??

The objective is that when you make a tool like markdown-to-foo, the output follows the spec. There is no mention of any "official toolchain".

> xmlstarlet

XML is strict. Try to find the same tool for HTML5, especially for transformations.

> JSON is an interchange format between machines.

Is pretty much what the specification would try to cover.

anthk 40 days ago

Instead of C, maybe, Lisp, or Forth without messing the stack.

eduction 40 days ago

> That... is not how anything happened.

What the heck are you talking about? User agent devs and users did indeed always go toward it mostly works.

masklinn 40 days ago

People didn't go towards "it mostly works", people go towards "it works at all". A lot of people tried to use xhtml, and it didn't work, broken content was pervasive and the experience when facing broken content was irredeemable.

tardedmeme 40 days ago

What was the exact nature of how devs found themselves unable to emit valid XML in all scenarios? What kind of bugs did they run into?

khimaros 39 days ago

as a person who just wants to publish a simple blog and informational articles, i would happily use this subset if it were still compatible with popular browsers. i've been praying for an effort like this to organize and am grateful that you are taking a stab at it. i would use dillo as my main testing browser if it was the browser that honored the subset spec most accurately and guaranteed compatibility with Chrome and Firefox.

chrismorgan 40 days ago

> There's a reason almost nobody serves XHTML with the application/xhtml+xml MIME type, and that reason is that getting a “parser error” (this is what browsers still do! try it!) is always worse than getting a page that 99% works.

That’s not the reason almost nobody serves XHTML.

The real reason is Internet Explorer. Okay, it’s a little more nuanced than that, but I think it’s accurate enough. Microsoft killed XHTML by inaction.

It’s 2004. XHTML is now a few years old, and all the rage. You decide to use it for your new project which you’re developing. At the start, you serve pages as application/xhtml+xml, and that works well in Firefox; but you know that won’t work because Internet Explorer still doesn’t support XHTML, and 90% of your viewers will be using that. So, a little frustrated, you serve your nice XHTML as text/html. You still validate it manually for a while, but then that habit disappears. Eventually you make one or two small mistakes that would have been caught easily if it were parsed as XML—but it’s not, because of Internet Explorer. Over time this disparity grows.

People have been complaining of the inefficacy of XHTML for this exact reason for two or three years by this point.

It’s 2006. XHTML is acknowledged to have failed. Everything else supports it, but as long as IE doesn’t, you can’t serve as application/xhtml+xml, and so you can’t get the advantages of XML syntax.

Seriously, early failure is good—so long as you’re working with it from the start. The problems only occur when you try to add strictness later.

Just look at typing in code bases. Adding strictness to existing JavaScript or Python or Ruby? Nightmare. Starting with static types? Somewhere between fine and extremely desirable.

(I might be overselling strictness’s popularity at the time—people don’t always like what’s good for them. We’ve largely realised now that unfettered dynamic typing is a bad idea, but ten years ago that was not settled. People get used to things. If IE had permitted XHTML early on, people would have got used to the idea of XHTML’s strictness and, I think, got to mostly like it.)

XHTML did not fail because of XML’s catastrophic parse failure mode. It failed because HTML already worked, and Internet Explorer took way too losng to accept XHTML. If you’re forking the web and compatibility with existing documents is not a goal, you can’t use XHTML’s failure as an argument: it failed because of compatibility issues.

Well, Internet Explorer did eventually support application/xhtml+xml: in 2011, IE9. Way too late to matter. And so only by around 2015 or 2016 could you finally serve with XML syntax. And now why would you? For your system is big and has tiny errors here and there and your CMS just drops markup in and never got round to validating it and and and and so on. By that time, HTML had given up on the XML path, and although it worked, the momentum was entirely gone, so you’d run into difficulties due to inadequate documentation, inferior tooling (ironic), and various more.

pkasting 40 days ago

XHTML was never all the rage. Your premise is false.

Hard errors up front are great when you control the full content pipeline. It's very rare that that's the case, and was rare even in 2004. As soon as including someone else's broken content in your page prevents users from seeing your content, and that someone else can break the content at any time and you can't control it... few people will want hard errors.

tardedmeme 40 days ago

What if you don't output invalid XML? If you can manage a valid HTTP response then you can manage valid XML, can't you?

idle_zealot 40 days ago

> There's a reason almost nobody serves XHTML with the application/xhtml+xml MIME type, and that reason is that getting a “parser error” (this is what browsers still do! try it!)

In this brave new world we can try again. This time, though, when a parser error occurs we can spin up an Agent in the background to fix the document, looping until it passes the parser's validation, then display that! We can then have the browser automatically submit a PR or bug report to the website operator with the fix.

That way we can achieve well-defined wire formats with deterministic rendering behavior!

krapp 40 days ago

Having web documents not render in case of errors is already bad. But we already have "auto-correction" for that case - it's how HTML rendering already works in browsers.

Having an LLM hallucinate a new page in case of errors isn't a better solution, it's qualitatively worse. If you want web documents to render with errors, just use HTML.