It's not, though. It's the easiest thing in the world: Just use a library that never emits unescaped content by default, or if you make a single-character typo.
The problem is that most of the libraries aren't that.
The back-end should see a 7-byte buffer with values [102 111 111 032 098 097 114], assume it's UTF-8 and convert that to its internal string representation?
no, the backend has no reason to see `foo%20bar` - you escape when you're combining that string with other strings (ie into HTML, into a SQL query, etc.)
Many database engines can handle arrays, or table-valued variables which are basically the same thing. Most ORMs will also abstract away arrays for you, so you as the developer never need to deal with escaping of data in arrays.
> It's the easiest thing in the world: Just use a library that never emits unescaped content by default
That doesn't make any sense? Escaping is a function of the consumer, not the producer. Hell, most of the problematic content doesn't come from a library to start with.
And if your Markdown -> HTML converter produces escaped content... it's not a Markdown -> HTML converter, because the result is not HTML.
More broadly, I think one of the core issues is this:
> Escape user input
User input is a broad and complicated category, and it's easy for user input to be "laundered" as it moves through an application.
And then escaping is an explicit action, which means it can be missed or forgotten, which is also a problem.
This means the solution is really that APIs should default to escaping most everything. Rather than having to mark "untrusted" content, it's trusted content which should be marked thus. "Escaping" is the wrong default.
But of course that doesn't solve all the issues. Like markdown, where you want the output of the Markdown converter to be trusted (otherwise the output won't be properly formatted on display), what you don't want trusted is the input, and that means you don't want the input to be laundered through the Markdown converter.
Which is an issue in most Markdown libraries, as they inherit the "trusted input" model from Gruber's original Markdown, where HTML passthrough was a feature.
In that sense one design I did enjoy is Jinja and Markupsafe in the Python ecosystem:
- Like most modern template libraries, Jinja escapes content by default.
- Also (though somewhat sadly) like most template libraries Jinja allows marking a value as safe at point-of-use, however that's dangerous as content can be mixed and it's easy for safe content to suddenly be swapped out for user input and become unsafe through seemingly unrelated changes.
- So a better method is to use `markupsafe.Markup` at the source, it's a string subclass which the library considers safe (because Jinja uses `markupsafe.escape` internally), the neat thing is any combination between a Markup instance and a non-Markup string will implicitly escape the non-Markup parameter(s).
This means you can mark safe content as safe at the source (where it's easy to prove it's safe because e.g. it's a literal), then most transformations will maintain the safety invariants. Though obviously it only works with content you know will ultimately be markup-injected.
And non-method APIs can't be overridden (e.g. re, or HTML/XML libraries) so they're not Markup-aware, they'll treat Markup objects as regular strings which that complicates processing pipelines if you want to conserve safety invariants. At the same time, those are laundering opportunities so care is useful.
« Escaping is a function of the consumer, not the producer »
This is incorrect. The producer emits something in a language, be it HTML or JSON or HTTP headers or whatever. Data must be encoded properly for that language. The consumer must then decode, of course, so in a sense it is the job of both. But the onus is really on the producer.
> This is incorrect. The producer emits something in a language, be it HTML or JSON or HTTP headers or whatever. Data must be encoded properly for that language.
Which is the consumption side. When you send data to an HTML template engine, it’s escaped as input, meaning with the template engine as consumer, not with the template engine as producer.
It may be a “pipeline” situation where the consumer also produces something (e.g. JSON or HTML), but it doesn’t have to be e.g. an SQL interface might have no production, but the data it consumes still needs to be properly escaped.
When your producer produces data, it has no idea how that data will be used, and that’s what determines the necessary transformations e.g. it’s of no help to you if your templating engine generates content escaped for MSSQL when you’re not going to put it in MSSQL.
> it’s of no help to you if your templating engine generates content escaped for MSSQL when you’re not going to put it in MSSQL.
Allow me to complain a bit about MSSQL.
When you're escaping a LIKE expression for MSSQL, you must also escape the "[" character, since it's a wildcard for MSSQL (and nowhere else except AFAIK Sybase). When you're escaping a LIKE expression for other databases, you must not escape the "[" character, since some databases reject escaping anything other than the % and _ wildcards. That is, your escaping code for a LIKE expression has to be database-specific, because MSSQL (and AFAIK Sybase, it seems both have a common ancestor) decided to be different.
> When you're escaping a LIKE expression for other databases, you must not escape the "[" character, since some databases reject escaping anything other than the % and _ wildcards. That is, your escaping code for a LIKE expression has to be database-specific, because MSSQL (and AFAIK Sybase, it seems both have a common ancestor) decided to be different.
TBF you may need custom codepaths because defaults diverge as well, IIRC postgres and sqlite default to ESCAPE '\' while mssql and oracle default to ESCAPE '' (the latter being the actual spec behaviour).
So in Postgres and SQLite you must always escape your LIKE parameter, while in mssql and oracle that's not the case.
> TBF you may need custom codepaths because defaults diverge as well, IIRC postgres and sqlite default to ESCAPE '\' while mssql and oracle default to ESCAPE '' (the latter being the actual spec behaviour).
The trick is to just avoid the default, and always use an explicit ESCAPE, which should work the same on every database (except mysql without NO_BACKSLASH_ESCAPES in which you also have to escape the backslash itself, otherwise it will escape the closing quote and get very confused, but that issue can be avoided by using a character other than backslash as the escape character).
The whole point is that the producer may be hostile, or buggy, and the consumer must handle that. Asserting that it “must” be encoded properly does not make it so.
That doesn't make sense to me and I agree with GP. If I consume HTML and I escape all HTML input I'm given, I'm utterly useless.
Now when I consume text and convert that text into HTML for further treatment, I'm producing HTML, and I must properly escape my input in that conversion. The escaping is only needed because I produce HTML. In fact the only time escaping can be done is when producing data, because if unescaped data is ever produced, the cat's out of the bag.
Edit: Actually think that producer/consumer is a wrong way to talk about this. Escaping only ever occurs at a boundary when transforming between formats (eg from "text string" to "html string") which is always both producer (of the new format) and consumer (of the old format). But it can always be thought of as a type cast, with possible type confusions when input and output formats share the same machine representation (eg string).
> That doesn't make sense to me and I agree with GP. If I consume HTML and I escape all HTML input I'm given, I'm utterly useless. [...] Now when I consume text and convert that text into HTML for further treatment, I'm producing HTML, and I must properly escape my input in that conversion.
Which is my point, it's the consumption side which defines what the escaping should be.
> Escaping only ever occurs at a boundary when transforming between formats (eg from "text string" to "html string") which is always both producer (of the new format) and consumer (of the old format).
A database interface is not a transformer / producer, needs escaping. Globbing is not a transformer either. Still needs escaping.
The thing that accepts the input must make sure it is properly escaped. Think of SQL injection attacks - they are because the thing that accepts input hasn't properly escaped the input.
Cross site scripting attacks are exactly the same thing but occur when the input side doesn't properly escape HTML input.