| HN Mirror

HTML didn’t make sense to me until I realized it’s built on a state machine and its rules are based on what’s on the stack of open elements. For example, a number of tags trigger a rule to close open P elements or list items, and many end tags trigger a rule saying something like “close open elements until you’ve closed one with the same name as this tag.”

This, IMO, is a bigger reason to avoid regex and XML parsers for HTML documents. The rules aren’t apparent when thinking linearly about what strings appear after or before each other; they become clearer when thinking of HTML as a shorthand syntax for certain kinds of push and pop operations.

XHTML is easier to parse, but for well-formed documents pushes the complexity of invalid markup into the rendering side. For example, it’s well-formed to include a button inside a button, so XHTML browsers render exactly this, but it makes no sense from a UI perspective and strange things happen when invalid markup is sent in well-formed XML.