Hacker News new | ask | show | jobs
by ndriscoll 695 days ago
HTML does have a preprocessor. It's called XSLT, and it has includes, though they have no deferred fetch. Also, being a preprocessor, you can't interact with it after page load (unless you use a javascript implementation). It's been built into browsers for 20+ years. Still works great, but browsers never supported versions past 1.0 so it shows its age some.
2 comments

> though they have no deferred fetch

Also, at least back when we excised the last bits of it from our old codebase, no useful caching of either stylesheets or included resources (other stylesheets), so if you tried to mix client-side processing with HTTPS you were in for quite some pain unless you had a fast, very low latency, uncongested, link.

Currently it looks like at least Firefox and Chromium both cache stylesheets and included files as you'd expect. In fact, you can use this to increase cacheability in general. e.g. when this site is having performance issues, it often works logged out/when serving static versions of pages. It's easy to make every page static by including a `/myuser.xml` document in the xsl template and using that to get the current logged in user/preferences to put on the page. This can then be private cached and the pages themselves can be public cached. You can likewise include an `/item-details.xml?id=xxxx` that could provide data for the page to add the logged in user's comment scores, votes, etc. If the included document fails to fetch, it falls back to being empty, and you get the static page (you could detect this and show a message).
XSLT is an XML transformation language, but HTML is not XML. Does XSLT work on regular HTML?
XSLT 3.0 can be directed to output HTML5 [0]. However, browsers only implement XSLT 1.0, and as far as I am aware there is no open-source XSLT 3.0 implementation.

Still, it's possible with XSLT 1.0 to produce documents in the common subset of XML and HTML5 ("XHTML5"). It can't produce the usual <!DOCTYPE html> at the top of the document, but it can produce the alternative <!DOCTYPE html SYSTEM "about:legacy-compat">.

On the input side, every XSLT version only accepts valid XML, as far as I am aware.

[0] https://www.w3.org/TR/xslt-xquery-serialization-30/#html-out...

`xsltproc --html` is an example of HTML input (probably HTML4 parsing rules though?) if you really need it. This is an XSLT 1.0 processor, wrapping libxslt which most browsers use.

As for output, the difference is largely irrelevant for browser purposes since they just want a tree.

I'm not sure how many extensions the browsers allow, but a major part of the reason XSLT 2/3 failed to take off is because libxslt already provides most of the useful features from newer versions as extensions (many via EXSLT-namespaced modules, at least partially supported in browsers - see MDN); what it doesn't do is implement the unnecessary complexity that the Java world loves.

At the time HTML was converted from SGML to XML: https://en.wikipedia.org/wiki/XHTML so if you authored XHTML, you could XSLT it. There is also XHTML5, an XML serialization of HTML5. I imagine in the real world there is a great deal of web that is HTML, accepted by browser, but not XML.
As far as I know, HTML5 has diverged from its origins enough that it's neither SGML nor XML. However, given the existence of XHTML5, it might be possible to parse an HTML5 DOM and re-serialize it as XHTML5, and thus it might be possible to take parseable HTML as input to XSLT, albeit with some indirection.
We were going to move on to XHTML after HTML4, for those variants (did it go beyond XHTML1.1?) HTML is XML compliant. That got caught in slow design-by-committee hell though so HTML5 became the defacto standard instead. There is XHTML5 which is an attempt to direct that back towards compliance, but I've never seen it used in the wild.
I'm not sure what you mean, but you can output HTML that a browser will be happy with, and that conforms to the spec. See e.g. https://stackoverflow.com/questions/3387127/set-html5-doctyp...
As I understand XSLT, it takes an XML document as input and an XML document describing the transformation, and produces an XML document as output.

But most HTML in the wild today is not valid XML. There is XHTML as mentioned by a sibling comment but it's rarely used. So if you were to start with an existing base of HTML documents, you couldn't easily add XSLT preprocessing to them. The issue is with the input rather than the output.

The fastest way to confirm that a given HTML document is not valid XML is to change the HTTP Content-Type from "text/html" to "application/xhtml+xml".

Here is what I know about using XHTML in practice: https://www.nayuki.io/page/practical-guide-to-xhtml

If you're using it as a template language for your own pages, you can of course just write it correctly (this is not different than needing to use correct syntax for react code to compile).

If you have someone else's documents, or need to mass convert your own to fix them, there's HTML tidy[0]. This one is quite useful to be able to run XML processing CLI tools on scraped web pages.

But the real power is in delivering XML to the client, not HTML. This lets you work in the domain model directly on the frontend, and use XSLT to transform that into HTML for display. So of course you'd use well-formed XML in that case.

Imagine if you didn't have a distinction between APIs and pages; you just returned the data along with a link to a template that says how to display it and (ideally) a link to a schema definition. Modifying templates for components could be as easy as modifying CSS attributes in the browser console, giving the next generation an easy way to peak and play around with how it all works as they're growing up. We were so close, and then things veered off into the hellworld that is the modern web.

[0] https://github.com/htacg/tidy-html5