Hacker News new | ask | show | jobs
by CharlesW 1090 days ago
> The same thing can also happen in RSS feeds […]: Entity-encoded HTML strings or CDATA HTML strings do not have any guarantee of well-formed-ness.

I wrote a podcast validator, and I don't think that's true — every RSS feed must be "well-formed" XML.

(Note that all "valid" XML documents are "well-formed", but "well-formed" XML documents are not necessarily "valid".)

1 comments

I was talking about the (X)HTML in that RSS feed and its well-formed-ness.

In a perfect world people would construct their XML documents with an API which guarantees that the generated serialisation is a well-formed XML document. E.g. the API guarantees that the element tree is nested, that namespaces are declared and that the serialiser escapes any text nodes. Then people could add their well-formed XHTML fragments as a child to <atom:content type="xhtml"> and then serialise the whole document, guaranteeing well-formed-ness across namespaces.

In practice people have a tagsoup string from their data store which they concatenate inside their RSS template in <description>. If you’re lucky, they replace "<" and "&" beforehand or do the CDATA thing. But in XML terms that is just a string, not well-formed markup.

Interesting, thank you. Every podcast RSS feed (a tiny subset of RSS feeds) I've seen in the wild is well-formed in the strict XML sense, so the tagsoup problem must be more endemic on the text syndication side.
I can imagine that that is potentially a result of Apple’s dominant podcast directory. Podcasters submit their feeds to Apple’s Podcast Connect, which I think flags warnings and errors. Other forms of feed don’t have that big motivation to validate.