| > It's the easiest thing in the world: Just use a library that never emits unescaped content by default That doesn't make any sense? Escaping is a function of the consumer, not the producer. Hell, most of the problematic content doesn't come from a library to start with. And if your Markdown -> HTML converter produces escaped content... it's not a Markdown -> HTML converter, because the result is not HTML. More broadly, I think one of the core issues is this: > Escape user input User input is a broad and complicated category, and it's easy for user input to be "laundered" as it moves through an application. And then escaping is an explicit action, which means it can be missed or forgotten, which is also a problem. This means the solution is really that APIs should default to escaping most everything. Rather than having to mark "untrusted" content, it's trusted content which should be marked thus. "Escaping" is the wrong default. But of course that doesn't solve all the issues. Like markdown, where you want the output of the Markdown converter to be trusted (otherwise the output won't be properly formatted on display), what you don't want trusted is the input, and that means you don't want the input to be laundered through the Markdown converter. Which is an issue in most Markdown libraries, as they inherit the "trusted input" model from Gruber's original Markdown, where HTML passthrough was a feature. In that sense one design I did enjoy is Jinja and Markupsafe in the Python ecosystem: - Like most modern template libraries, Jinja escapes content by default. - Also (though somewhat sadly) like most template libraries Jinja allows marking a value as safe at point-of-use, however that's dangerous as content can be mixed and it's easy for safe content to suddenly be swapped out for user input and become unsafe through seemingly unrelated changes. - So a better method is to use `markupsafe.Markup` at the source, it's a string subclass which the library considers safe (because Jinja uses `markupsafe.escape` internally), the neat thing is any combination between a Markup instance and a non-Markup string will implicitly escape the non-Markup parameter(s). This means you can mark safe content as safe at the source (where it's easy to prove it's safe because e.g. it's a literal), then most transformations will maintain the safety invariants. Though obviously it only works with content you know will ultimately be markup-injected. And non-method APIs can't be overridden (e.g. re, or HTML/XML libraries) so they're not Markup-aware, they'll treat Markup objects as regular strings which that complicates processing pipelines if you want to conserve safety invariants. At the same time, those are laundering opportunities so care is useful. |
« Escaping is a function of the consumer, not the producer »
This is incorrect. The producer emits something in a language, be it HTML or JSON or HTTP headers or whatever. Data must be encoded properly for that language. The consumer must then decode, of course, so in a sense it is the job of both. But the onus is really on the producer.