Hacker News new | ask | show | jobs
by ttepasse 1085 days ago
> I have implemented rss 2.0 parser faster then understanding the atom specification. Atom can do encode stuff like encode html inline the xml instead of as a CDATA string. In theory this sounds great, but is ends up in a big mess of complexity (e.g. a blogpost with handwritten invalid html).

The same thing can also happen in RSS feeds (and JSON Feeds): Entity-encoded HTML strings or CDATA HTML strings do not have any guarantee of well-formed-ness. The direct embedding of XHTML into Atom as namespaced elements just surfaces potential invalid markup higher up.

1 comments

> The same thing can also happen in RSS feeds […]: Entity-encoded HTML strings or CDATA HTML strings do not have any guarantee of well-formed-ness.

I wrote a podcast validator, and I don't think that's true — every RSS feed must be "well-formed" XML.

(Note that all "valid" XML documents are "well-formed", but "well-formed" XML documents are not necessarily "valid".)

I was talking about the (X)HTML in that RSS feed and its well-formed-ness.

In a perfect world people would construct their XML documents with an API which guarantees that the generated serialisation is a well-formed XML document. E.g. the API guarantees that the element tree is nested, that namespaces are declared and that the serialiser escapes any text nodes. Then people could add their well-formed XHTML fragments as a child to <atom:content type="xhtml"> and then serialise the whole document, guaranteeing well-formed-ness across namespaces.

In practice people have a tagsoup string from their data store which they concatenate inside their RSS template in <description>. If you’re lucky, they replace "<" and "&" beforehand or do the CDATA thing. But in XML terms that is just a string, not well-formed markup.

Interesting, thank you. Every podcast RSS feed (a tiny subset of RSS feeds) I've seen in the wild is well-formed in the strict XML sense, so the tagsoup problem must be more endemic on the text syndication side.
I can imagine that that is potentially a result of Apple’s dominant podcast directory. Podcasters submit their feeds to Apple’s Podcast Connect, which I think flags warnings and errors. Other forms of feed don’t have that big motivation to validate.