Hacker News new | ask | show | jobs
by mort96 310 days ago
I'm not sure it's possible to have a technology that's user-facing with multiple competing implementations, and not also, in some way, "liberal in what it accepts".

Back when XHTML was somewhat hype and there were sites which actually used it, I recall being met with a big fat "XML parse error" page on occasion. If XHTML really took off (as in a significant majority of web pages were XHTML), those XML parse error pages would become way more common, simply because developers sometimes write bugs and many websites are server-generated with dynamic content. I'm 100% convinced that some browser would decide to implement special rules in their XML parser to try to recover from errors. And then, that browser would have a significant advantage in the market; users would start to notice, "sites which give me an XML Parse Error in Firefox work well in Chrome, so I'll switch to Chrome". And there you have the exact same problem as HTML, even though the standard itself is strict.

The magical thing of HTML is that they managed to make a standard, HTML 5, which incorporates most of the special case rules as implemented by browsers. As such, all browsers would be lenient, but they'd all be lenient in the same way. A strict standard which mandates e.g "the document MUST be valid XML" results in implementations which are lenient, but they're lenient in different ways.

HTML should arguably have been specified to be lenient from the start. Making a lenient standard from scratch is probably easier than trying to standardize commonalities between many differently-lenient implementations of a strict standard like what HTML had to do.

6 comments

Are you aware of HTML 5? Fun fact about it: there's zero leniency in it. Instead, it specifies a precise semantics (in terms of parse tree) for every byte sequence. Your parser either produces correct output or is wrong. This is the logical end point of being lenient in what you accept - eventually you just standardize everything so there is no room for an implementation to differ on.

The only difference between that and not being lenient in the first place is a whole lot more complex logic in the specification.

> Are you aware of HTML 5? Fun fact about it: there's zero leniency in it.

I think you understand what I mean. Every byte sequence has a

> The only difference between that and not being lenient in the first place is a whole lot more complex logic in the specification.

Not being lenient is how HTML started out.

History has gone the way it went & we have HTML now, there's not much point harking back, but I still find it very odd that people today - with the wisdom of foresight - believe that the world opting for HTML & abandoning XHTML was the sensible choice. It seems odd to me that it's not seen as one of those "worse winning out" stories in the history of technology, like betamax.

The main argument about XHTML not being "lenient" always centred around client UX of error display - Chrome even went on to actually implement a user-friendly partial-parse/partial-render handling of XHTML files that literally solved everyone's complaints via UI design without any spec changes but by this stage it was already too late.

The whole story of why we went with HTML is somewhat hilarious: 1 guy wrote an ill informed blog post bitching about XHTML, generated a lot of hype, made zero concrete proposals to solve its problems, & then somehow convinced major browser makers (his current & former employers) to form an undemocratic rival group to the W3C, in which he was appointed dictator. An absolutely bizarre story for the ages, I do wish it was documented better but alas most of the resources around it were random dev blogs that link rotted.

>The whole story of

Is that really the story? I think it was more like "backward compatible solution soon about more pure, theoretically better solution"

There's enormous non-xhtml legacy than nobody wanted to port. And tooling back in the day didn't make it easy to write correct xhtml.

Also like it or not, HTML is still written by humans sometimes, and they don't like parser blowing up because of a minor problem. Especially since such problems are often detected late, and a page which displays slightly wrong is much better outcome than the page blowing up.

More or less? https://en.m.wikipedia.org/wiki/WHATWG is fairly neutral. As someone in userland at the time on the other side of it, it was all a bit nuts.

IE we got new standards invented out of thin air - https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Mic... - which ignored what hundreds had worked on before, which seemed to be driven by one person controlling the "standard" making it up as they went along.

Microformats and RDFa were the more widely adopted solutions at the time, had a lot of design and thought put into them, worked with HTML4 (but thrived if used with xhtml), etc etc.

JSON-LD/schema.org has now filled the niche and arguably it's a lot better for devs, but imagine how much better the "AI web UX" would be now if we'd just standardised earlier on one and stuck with it for those years?

This is the main area where I saw the behaviour on display, where I interacted most. So the original comment feels absolutely in line with my recollections.

I love bits of HTML5, but the way it congealed into reality isn't one of them.

> There's enormous non-xhtml legacy than nobody wanted to port.

This is a fair argument if content types were being enforced but XML parsing was opt-in (for precisely this reason).

> And tooling back in the day didn't make it easy to write correct xhtml.

True. And instead of developing such tooling, we decided to boil the ocean to get to the point where tooling today doesn't make it any easier to lint / verify / validate your HTML. Mainly because writing such tooling to a non-strict target like HTML is a million times harder than to a target with strict syntax.

A nice ideal would've been IDEs & CI with strict XHTML parsers & clients with fallbacks (e.g. what Chromium eventually implemented)

> I recall being met with a big fat "XML parse error" page on occasion. If XHTML really took off (as in a significant majority of web pages were XHTML), those XML parse error pages would become way more common

Except JSX is being used now all over the place and JSX is basically the return of XHTML! JSX is an XML schema with inline JavaScript.

The difference now days is all in the tooling. It is either precompiled (so the devs see the error) or generated on the backend by a proper library and not someone YOLOing PHP to super glue strings together, as per how dynamic pages were generated in the glory days of XHTML.

We basically got full circle back to XHTML, but with a lot more complications and a worse user experience!

Speaking of PHP, XHP predates JSX I believe:

https://en.wikipedia.org/wiki/XHP

Same idea: an XML-like syntax for creating object trees typically used to model HTML.

Nobody is generating JSX dynamically.
Not directly as strings of course, but a for loop that outputs a bunch of JSX components based on the array return values from a DB fetch is dynamically generated JSX.
No, it's not. The JSX, as in the text in the source file, is static. You can't accidentally forget to escape a string from the database and therefore end up with invalid JSX syntax, like you can when dynamically generating HTML. You're dynamically generating shadow DOM nodes, but the JSX is static.
You are saying the exact same thing I am, just with different words.

JSX makes it impossible to crap out invalid HTML because it is a library + toolchain (+ entire ecosystem) that keeps it from happening. JSX is always checked for validity before it gets close to the user, so the most irritating failure case of XHTML just doesn't happen.

XHTML never had they benefit. My point is that if libraries like React or Vue had been popular at the time of XHTML, then XHTML's strictness wouldn't have been an issue because JSX always generated valid outputs (well ignoring compiler bugs which I far too damn many of early on in React's life)

> If XHTML really took off (as in a significant majority of web pages were XHTML), those XML parse error pages would become way more common

This is not true because you are imagining a world with strict parsing but where people are still acting as though they have lax parsing. In reality, strict parsing changes the incentives and thus people’s behaviour.

This is really easy to demonstrate: we already have a world with strict parsing for everything else. If you make syntax error with JSON, it stops dead. How often is it that you run into a website that fails to load because there is a syntax error in JSON? It’s super rare, right? Why is that? It’s because syntax errors are fatal errors. This means that when developing the site, if the developer makes a syntax error in JSON, they are confronted with it immediately. It won’t even load in their development environment. They can’t run the code and the new change can’t be worked on until the syntax error is resolved, so they do that.

In your hypothetical world, they are making that syntax error… and just deploying it anyway. This makes no sense. You changed the initial condition, but you failed to account for everything that changes downstream of that. If syntax errors are fatal errors, you would expect to see far, far fewer syntax errors because it would be way more difficult for a bug like that to be put into production.

We have strict syntax almost everywhere. How often do you see a Python syntax error in the backend code? How often do you run across an SVG that fails to load because of a syntax error? HTML is the odd one out here, and it’s very clear that Postel was wrong:

https://datatracker.ietf.org/doc/rfc9413/

> This is not true because you are imagining a world with strict parsing but where people are still acting as though they have lax parsing. In reality, strict parsing changes the incentives and thus people’s behaviour.

Dude I lived in that world. A fair amount of developers explicitly opted into strict parsing rules by choosing to serve XHTML. And yet, those developers who opted into strict parsing messed up their XML generation frequently enough that I, as an end user, was presented with that "XML Parse Error" page on occasion. I don't understand why you'd think all developers would stop messing up if only strict parsing was hoisted upon everyone rather than only those who explicitly opt in.

> In your hypothetical world, they are making that syntax error… and just deploying it anyway.

No, they're not. In my (non-hypothetical, actually experienced in real life) world of somewhat wide-spread XHTML, I'm assuming that developers would make sites which appeared to work with their test content, but would produce invalid XML in certain situations with some combination of dynamic content or other conditions. Forgetting to escape user content is the obvious case, but there are many ways to screw up HTML/XHTML generation in ways which appear to work during testing.

> We have strict syntax almost everywhere. How often do you see a Python syntax error in the backend code?

Never, but people don't dynamically generate their Python back-end code based on user content.

> How often do you run across an SVG that fails to load because of a syntax error?

Never, but people don't typically dynamically generate their SVGs based on user content. Almost all SVGs out there are served as static assets.

> Dude I lived in that world. A fair amount of developers explicitly opted into strict parsing rules by choosing to serve XHTML.

No they didn’t, unless you and I have wildly different definitions of “a fair amount”. The developers who did that were an extreme minority because Internet Explorer, which had >90% market share, didn’t support application/xhtml+xml. It was a curiosity, not something people actually did in non-negligible numbers.

And you’re repeating the mistake I explicitly called out. Opting into XHTML parsing does not transport you to a world in which the rest of the world is acting as if you are in a strict parsing world. If you are writing, say, PHP, then that language was still designed for a world with lax HTML parsing no matter how you serve your XHTML. There is far more to the world than just your code and the browser. A world designed for lax parsing is going to be very different to a world designed for strict parsing up and down the stack, not just your code and the browser.

> I'm assuming that developers would make sites which appeared to work with their test content, but would produce invalid XML in certain situations with some combination of dynamic content or other conditions. Forgetting to escape user content is the obvious case, but there are many ways to screw up HTML/XHTML generation in ways which appear to work during testing.

Again, you are still making the same mistake of forgetting to consider the second-order effects.

In a world where parsing is strict, a toolchain that produces malformed syntax has a show-stopping bug and would not be considered reliable enough to use. The only reason those kinds of bugs are tolerated is because parsing is lax. Where is all the JSON-generating code that fails to escape values properly? It is super rare because those kinds of problems aren’t tolerated because JSON has strict parsing.

> No they didn’t, unless you and I have wildly different definitions of “a fair amount”. The developers who did that were an extreme minority because Internet Explorer, which had >90% market share, didn’t support application/xhtml+xml. It was a curiosity, not something people actually did in non-negligible numbers.

Despite being an extreme minority of strict parsing enthusiasts who decided to explicitly opt into strict parsing, they still messed up enough for me to occasionally have encountered "XML Parse Error" pages. You'd think that if anyone managed to correctly generate strict XHTML, it'd be those people.

> You'd think that if anyone managed to correctly generate strict XHTML, it'd be those people.

Once more, they were operating in a world designed for lax parsing. Even if their direct choices were for strict parsing, everything surrounding them was lax.

Somebody making the choice to enable strict parsing in a world designed for lax parsing is a fundamentally different scenario than “If XHTML really took off (as in a significant majority of web pages were XHTML)”, where the entire technology stack from top to bottom would be built assuming strict parsing.

> Never, but people don't dynamically generate their Python back-end code based on user content.

Perhaps not much in the past - but I suspect with Agentic systems a lot more in the future - are you suggesting relaxing the Python syntax to make it easier for auto-generated code to 'run'?

There's a difference between static chatbot-generated code that you commit to your VCS, and dynamic code generated based on user content. I'm not talking about the former case.
> I'm not talking about chatbot generated code that's committed to your VCS.

I'm talking about code that's dynamically generated in response to user input in order to perform the task the user specified. This is happening today in agentic systems.

Should we relax python syntax checking to work around the poorly generated code?

I remember "HTML 5 W3C Valid" buttons, proudly displayed by web pages. This was considered cool, so why doing the same for XHTML wouldn't be same?
lol CVE-2020-26870