Hacker News new | ask | show | jobs
by JimDabell 309 days ago
> If XHTML really took off (as in a significant majority of web pages were XHTML), those XML parse error pages would become way more common

This is not true because you are imagining a world with strict parsing but where people are still acting as though they have lax parsing. In reality, strict parsing changes the incentives and thus people’s behaviour.

This is really easy to demonstrate: we already have a world with strict parsing for everything else. If you make syntax error with JSON, it stops dead. How often is it that you run into a website that fails to load because there is a syntax error in JSON? It’s super rare, right? Why is that? It’s because syntax errors are fatal errors. This means that when developing the site, if the developer makes a syntax error in JSON, they are confronted with it immediately. It won’t even load in their development environment. They can’t run the code and the new change can’t be worked on until the syntax error is resolved, so they do that.

In your hypothetical world, they are making that syntax error… and just deploying it anyway. This makes no sense. You changed the initial condition, but you failed to account for everything that changes downstream of that. If syntax errors are fatal errors, you would expect to see far, far fewer syntax errors because it would be way more difficult for a bug like that to be put into production.

We have strict syntax almost everywhere. How often do you see a Python syntax error in the backend code? How often do you run across an SVG that fails to load because of a syntax error? HTML is the odd one out here, and it’s very clear that Postel was wrong:

https://datatracker.ietf.org/doc/rfc9413/

1 comments

> This is not true because you are imagining a world with strict parsing but where people are still acting as though they have lax parsing. In reality, strict parsing changes the incentives and thus people’s behaviour.

Dude I lived in that world. A fair amount of developers explicitly opted into strict parsing rules by choosing to serve XHTML. And yet, those developers who opted into strict parsing messed up their XML generation frequently enough that I, as an end user, was presented with that "XML Parse Error" page on occasion. I don't understand why you'd think all developers would stop messing up if only strict parsing was hoisted upon everyone rather than only those who explicitly opt in.

> In your hypothetical world, they are making that syntax error… and just deploying it anyway.

No, they're not. In my (non-hypothetical, actually experienced in real life) world of somewhat wide-spread XHTML, I'm assuming that developers would make sites which appeared to work with their test content, but would produce invalid XML in certain situations with some combination of dynamic content or other conditions. Forgetting to escape user content is the obvious case, but there are many ways to screw up HTML/XHTML generation in ways which appear to work during testing.

> We have strict syntax almost everywhere. How often do you see a Python syntax error in the backend code?

Never, but people don't dynamically generate their Python back-end code based on user content.

> How often do you run across an SVG that fails to load because of a syntax error?

Never, but people don't typically dynamically generate their SVGs based on user content. Almost all SVGs out there are served as static assets.

> Dude I lived in that world. A fair amount of developers explicitly opted into strict parsing rules by choosing to serve XHTML.

No they didn’t, unless you and I have wildly different definitions of “a fair amount”. The developers who did that were an extreme minority because Internet Explorer, which had >90% market share, didn’t support application/xhtml+xml. It was a curiosity, not something people actually did in non-negligible numbers.

And you’re repeating the mistake I explicitly called out. Opting into XHTML parsing does not transport you to a world in which the rest of the world is acting as if you are in a strict parsing world. If you are writing, say, PHP, then that language was still designed for a world with lax HTML parsing no matter how you serve your XHTML. There is far more to the world than just your code and the browser. A world designed for lax parsing is going to be very different to a world designed for strict parsing up and down the stack, not just your code and the browser.

> I'm assuming that developers would make sites which appeared to work with their test content, but would produce invalid XML in certain situations with some combination of dynamic content or other conditions. Forgetting to escape user content is the obvious case, but there are many ways to screw up HTML/XHTML generation in ways which appear to work during testing.

Again, you are still making the same mistake of forgetting to consider the second-order effects.

In a world where parsing is strict, a toolchain that produces malformed syntax has a show-stopping bug and would not be considered reliable enough to use. The only reason those kinds of bugs are tolerated is because parsing is lax. Where is all the JSON-generating code that fails to escape values properly? It is super rare because those kinds of problems aren’t tolerated because JSON has strict parsing.

> No they didn’t, unless you and I have wildly different definitions of “a fair amount”. The developers who did that were an extreme minority because Internet Explorer, which had >90% market share, didn’t support application/xhtml+xml. It was a curiosity, not something people actually did in non-negligible numbers.

Despite being an extreme minority of strict parsing enthusiasts who decided to explicitly opt into strict parsing, they still messed up enough for me to occasionally have encountered "XML Parse Error" pages. You'd think that if anyone managed to correctly generate strict XHTML, it'd be those people.

> You'd think that if anyone managed to correctly generate strict XHTML, it'd be those people.

Once more, they were operating in a world designed for lax parsing. Even if their direct choices were for strict parsing, everything surrounding them was lax.

Somebody making the choice to enable strict parsing in a world designed for lax parsing is a fundamentally different scenario than “If XHTML really took off (as in a significant majority of web pages were XHTML)”, where the entire technology stack from top to bottom would be built assuming strict parsing.

> Never, but people don't dynamically generate their Python back-end code based on user content.

Perhaps not much in the past - but I suspect with Agentic systems a lot more in the future - are you suggesting relaxing the Python syntax to make it easier for auto-generated code to 'run'?

There's a difference between static chatbot-generated code that you commit to your VCS, and dynamic code generated based on user content. I'm not talking about the former case.
> I'm not talking about chatbot generated code that's committed to your VCS.

I'm talking about code that's dynamically generated in response to user input in order to perform the task the user specified. This is happening today in agentic systems.

Should we relax python syntax checking to work around the poorly generated code?