Hacker News new | ask | show | jobs
by JimDabell 3427 days ago
You sometimes need more control over the actual HTML document than that; for instance to work around browser bugs or for efficiency. But if you are only interested in the semantics, then it's still not an adequate representation of the document. How would you, for instance, add an attribute to the body element? If you're dealing with a semantic representation like a DOM library would give you, then this would be trivial, because the body element would be part of the model you are working with. But the body element doesn't exist in that S-expression. You'll have to manually insert it, which involves further domain-specific knowledge embedded in your code.

Basically, it's stuck in-between two states doing neither correctly. It doesn't represent the actual HTML document, and it doesn't represent the parsed document structure. It's an alternative model of the HTML document that serialises to something that would be parsed in an equivalent way. I'm sure that's useful in a whole bunch of different situations, but it's not as simple as "S-expressions can do everything HTML can, in a convenient way".

S-expressions are great, and very useful. But they aren't the right tool for every situation. HTML is an odd markup language that only appears simple superficially, with all kids of irregular corner cases creeping in when you dig into the details. S-expressions would be a great fit if HTML were as simple as it appears on the surface, but it's not.

2 comments

I agree, and I'd also like to add that I find general discussions about s-expr vs markup (as well as JSON vs XML years ago) pointless.

Markup is meant as a text format for content authors that can be parsed into a hierarchical structure, rather than as general-purpose data representation syntax, even though XML is being frequently (ab-)used for this purpose.

The original use case for markup is that you can take a piece of plain text and then mark it up with tags, unlike s-expr and/or JSON which arise out of the syntax of a programming language and need eg. verbatim text to be written as string constants/with quotation characters.

> The original use case for markup is that you can take a piece of plain text and then mark it up with tags

Yes, that was the original use case, but in actual practice HTML has not been used that way for a long time. Nowadays HTML is de facto used as a programming language for the visual representation layer of a browser. No one actually uses HTML to mark up documents by hand any more, for two reasons: first, no one writes plain text documents to use as source material for markup. They write Word documents, or TeX documents, but plain text source is all but unheard of nowadays. And second, HTML syntax is too clumsy and places too many demands on the user. So when ordinary people want to produce HTML they use WYSIWYG editors. When geeks want to produce HTML (and remember I'm talking about documents here) they use markdown. The only time anyone writes HTML nowadays is when they want to make a browser do something fancy.

You are confusing syntax and semantics. HTML and the DOM are two different things. HTML is a string of characters (syntax). The DOM is a data structure (semantics). Normally a DOM is produced by parsing HTML, but it can be produced in other ways (by running Javascript code, for example).

S-expressions are a data structure, different from the DOM, but S-expression syntax is a syntax. Normally S-expression syntax is parsed to produce S-expressions, but can also be parsed to produce other things. S-expression syntax can be parsed to produce a DOM. The easiest way to do this is to parse S-expression syntax ino S-expressions, render those S-expressions into HTML code, and then use an off-the-shelf HTML parser to parse the HTML. But you could also write a parser that parsed S-expression syntax directly into a DOM if you wanted to. You could also write a transformation program that compiled S-expressions directly into a DOM without going through the intermediate HTML.

The answer to your question of how to add an attribute to an implied element is that it is not possible to do that in HTML. It is only possible to add an attribute to an implicit element of the DOM produced by parsing an HTML document that omits that element (because at that point the element is no longer implicit). The exact same thing is possible using S-expressions. For example, here's how you write tables in my library:

(:table (header header ...) (data data ...) (data data ...))

This string of characters is parsed by the Lisp reader to produce an S-expression that has a one-to-one correspondence with the string you see above. But then there is an extra processing step that transforms that into a different S-expression whose printed representation is:

(:table (:tr (:th header) (:th header) ...) (:tr (:td data) (:td data) ...) ...)

At that point you can manipulate that S-expression in the same way that you manipulate the DOM (because they are both just data structures). Once you're done, you convert the S-expression to a DOM. At the moment that is done by rendering to HTML, but as I noted above that is just an implementational convenience to take advantage of the fact that HTML->DOM parsers are available off the shelf. You don't have to do it that way (and indeed the world would be a better place if it were not done that way).

All of this is trivial when dealing with S-expressions precisely because of the strict 1-to-1 correspondence between data structure and visual representation that does not exist in SGML-derived languages. That is why writing code for SGML-derived languages using S-expression syntax is so advantageous. (Actually, this is true for any language, not just SGML-derived languages. It's just a little more obvious for SGML-derived languages because SGML syntax already kinda sorta looks like a data structure representation so it's a little easier to grasp what is going on.)

> HTML is a string of characters (syntax). The DOM is a data structure (semantics). [...] S-expressions are a data structure, different from the DOM, but S-expression syntax is a syntax.

I believe this is where the confusion is coming from. When you parse HTML syntax, you get a data structure; this is the same as when you read sexpr syntax, you also get a data structure. Both these data structures are different from the DOM tree.

Try this example:

    <pre>
      <span>one
      </span>
      <br>
      <span>two</span>
      <br />
    </pre>
Can CL-WHO generate HTML that matches that? (i.e. feed both into a tool like BeautifulSoup and produce the same data structure?)

Outside of CL-WHO and Hiccup-type libraries, you can of course use S-exprs to represent the same data structure. Here's a hypothetical S-expr syntax that might produce the same data structure:

    ((pre)
      "\n  " (span) "one\n  " (/span)
      "\n  " (br)
      "\n  " (span) "two" (/span)
      "\n  " (br/) "\n"
     (/pre))
Which is what I believe JimDabell meant by:

> you can't represent all valid HTML documents as S-expressions, at least not in the convenient way people assume

> Both these data structures are different from the DOM tree.

In the case of S-expressions that is true. In the case of HTML it may or may not be true. It depends on how the HTML parser is implemented. There is a "natural" mapping of HTML onto a parse tree that is different from the DOM, but that is not part of the standard (AFAIK).

> Can CL-WHO generate HTML that matches that?

Yes, though native Common Lisp does not provide c-like string escapes so putting in newlines is a little awkward. You could, of course, bring in a string interpolation library, but here's how you can do it without that:

    ? (defun nl () (who (fmt "~%")))     ; NL = NewLine
    NL
    ? (defun nli () (who (fmt "~%  ")))  ; NLI = NewLine + Indent
    NLI
    ? (princ (html (:pre (nli) (:span "one" (nli)) (nli) (:br (nli) (:span "two") (nl)))))
    
     <pre>
       <span>one
       </span>
       <br>
       <span>two</span>
     </br></pre>
Or you could do this:

    (html (:pre "
      <span>one
      </span>
      <br>
      <span>two</span>
      <br />
    "))
which looks like cheating but is actually closer to the spirit of the original.

The PRE tag is really weird because it actually changes the way things inside it are parsed. You can actually implement that in Lisp too via reader macros. CL-WHO doesn't support that out of the box, but it's not hard.

I can't imagine anyone actually wanting to do that, though. The PRE tag is for presenting pre-formatted text without changing its appearance, so embedding other tags inside it is kinda perverse. [EDIT: I was wrong about this. See below.]

There are uses for pre with tags embedded.

pre provides the simplified line breaking and usually a monospaced font. However, tags are available to do whatever else.

A major example is that the Vim editor uses pre for formatting syntax colored code to HTML (when you do that with :TOhtml).

The output is a pre block containing various span elements which are styled with CSS.

BTW where in the HTML spec does it say that the interior of pre is parsed differently?

If we are parsing HTML (to Lisp objects or whatever), we should preserve the exact whitespace. The reverse generation should regurgitate the original whitespace.

If we take the license to eliminate newlines, then we ruin pre. The fix is simply not to do that.

> where in the HTML spec does it say that the interior of pre is parsed differently?

I was wrong about that. I had a vague memory of putting HTML inside a PRE tag once and having it come out as if it were escaped, but apparently I hallucinated that.

> A major example is that the Vim editor uses pre for formatting syntax colored code to HTML (when you do that with :TOhtml).

OK, I stand corrected on that too.

> If we are parsing HTML (to Lisp objects or whatever), we should preserve the exact whitespace. The reverse generation should regurgitate the original whitespace. > If we take the license to eliminate newlines, then we ruin pre. The fix is simply not to do that.

Right.

Actually, I just realized that I mis-read the example. I saw <br /> and thought it was </br>. (Maybe the OP edited it?) In any case, the example now reads:

    <pre>
      <span>one
      </span>
      <br>
      <span>two</span>
      <br />
    </pre>
And you can render that in sexpr syntax as:

    (:pre "
      " (:span "one
      ") "
      " (:br) "
      " (:span "two") "
      " (:br) "
    ")
This is a particularly bad example to demonstrate here because the whitespace in the code plays badly with the whitespace in the HN markup. But I tried running this code and it does work. Here is the output copied-and-pasted verbatim from my listener:

    <pre>
      <span>one
      </span>
      <br />
      <span>two</span>
      <br />
    </pre>
Note that both BR tags are rendered as <br />.
It was <br> and <br /> for my example (</br> isn't a valid tag). The point that I was getting at was that <br> and <br /> self-closing tag are represented differently (<tag>, <tag />, and <tag></tag> are all different) in a parsed SGML data structure (though they both are equivalent in the HTML DOM tree in the browser).

This is why you would need separate tags to emit them properly with an S-expr syntax (tag), (tag/), and (tag)(/tag) in my example.