An alternate view: “string” is not a granular enough type, just like “bitfield” is not a type. Firstly, a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4), and you absolutely need to know which it is. But let’s assume that you’ve been a diligent programmer and filtered all that at the edges, and now have a sequence of Unicode code points (or possibly graphemes). You still need to know the escaped-ness of the string! This is also a form of typing. Perl was early with its concept of “tainted” strings, but modern languages can use types to mark this concept in the code. At all points in your code, you should be sure what type the value you have is. If you need to use the types in your language to ensure this, then use types. But make sure of it somehow.
> Firstly, a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4), and you absolutely need to know which it is.
This is a language defect. If your language was invented in the 1960s it's an understandable defect, but it's still a defect. I do not want to write computer software with strings in a language that doesn't even have an actual string type rather than "Eh, maybe this is a string or maybe it's just some random bytes, who cares".
Only in very low level software should it make a difference whether the string is in fact represented as UTF-8 or UTF-16 or whatever, but Rust shows that you can write software at a low level and still enforce type safety for strings.
I agree though that here once again the Right Thing™ is a strong type system. If I've got a Microsoft Graph username, a URL, an email address and a UUID, that's four types, those are not four strings with human names to distinguish them. We don't need to escape some or any of these types - in their context.
A type system isn't going to save you from users submitting all kinds of potentially different encodings. Which also depends on what kind of user input is being handled: Is it OS-provided UI? Is it something being sent to a service accessible on the internet? Is it from a CLI? Is it from a file? Context matters for the potential space of what kind of data you might be operating on, which could require different ways of either knowing what kind of data you have based on having more control over the input versus having to detect stuff (or be told, correctly) from highly arbitrary things like reading from a file. All of that is external to the type system, and requires doing something before you can tag it with the correct type. Some languages might attempt to detect this stuff for you, but that could potentially be considered a language defect if it's hard to detect what a string is without having other input telling you what that string contains, such as a header in an HTTP request saying that it's UTF-8.
"A type system isn't going to save you from users submitting all kinds of potentially different encodings."
Yes, it is, because you give that a type that indicates you don't know what the encoding is, like RawInput or something. You then can not pass this type to any other function that doesn't explicitly call for that type. If you have some function that accepts it, blindly casts it to UTF-8, and slams it out into a file, well, that's not the type system's fault [1].
Of course a type system won't prevent you from still just being wrong or writing bugs; nobody promises that, not even the formal methods advocates. But it will prevent you from just accidentally blindly shoveling it out somewhere it doesn't belong without ever examining it or thinking about it.
I think you may be believing in a popular myth about strong typing systems, that they are designed to somehow prevent bad data from coming in to your system at all. You correctly identify that as impossible. But what strong typing systems can do is force you to deal with the fact that bad data may be coming in. On the outside, you have the chaos of, say, a bag of bytes that may or may not be JSON. On the inside, you have a "type SomeStruct { int a; int b }". A strong type systems forces you to write some sort of adapting code between those two, and guarantees that the result of that adapting code will be only and exactly the type that comes out of that adapting code, no "whoops, sometimes this dynamic code just returns a string, or maybe a network socket, or who knows what". Nothing can prevent your HTTP API from receiving a JPG of an anime character instead of JSON specifying a user to delete, but a strong type system can make you deal with that immediately and fully, instead of garbage data of indeterminate type floating through the system for an indeterminate period of time.
[1]: Also note there are a lot of "strong type systems" in the world that still fail to take advantage of their own capabilities and let bare string types and such float around too much. There are reasons why libraries must support the lowest common denominator; a file is a series of bytes with no further constraints, so the lowest level API has no choice but to accept that, but higher level APIs should more often take more restricted types. That strong type systems can save you from this doesn't mean they all do. I have a number of wrapper types in various languages just to add these guarantees to my programs not provided by the underlying libraries, though I also have some code that just wraps the underlying libraries that can't help but correctly take raw bytes at the lowest level.
>If you have some function that accepts it, blindly casts it to UTF-8
Unfortunately, if you interact with services you didn't write, you're usually back to getting "strings" of unknown encoding, and typically requirements that force some blind or semi-blind guessing.
Blind guessing is not related to the type system. Nobody has claimed type systems can solve that. What they can do is force you to guess, and make it clear where that is occurring.
This, again, goes back to a very broken understanding of types systems that I often see, and once held myself. The claim of type systems is not that they magically go out into the world and fix the external world to be well-typed; the claim is that it forces your code to deal with the conversion of the external world into a clean internal representation, and presumably, to have a clean error pathway when that fails. Dynamically-typed code will let you float along much more easily. Statically-typed code can still be written that way, but at least then it's poor statically-typed code. In some circles that sort of broken dynamic code is essentially idiomatic. (Though that is fading away as every year more programmers learn how bad an idea that is.)
If the language explicitly says how strings are defined, libraries that go "Eh, I'll just shove nonsense bytes in this data structure and claim that's a string" are broken by definition.
That's just as true in Java as in Rust. The problem is languages like C++ or D which just don't care and have a "string" type that might just be some bytes.
The way Rust does it is IMO interesting. There is e.g. an OsStr for strings that e.g. describe filenames in an directory listing, because these could actually be invalid UTF-8 but your program might still need to be able to handle them.
So when you wanna convert that OsStr to a String you are forced to handle this in one way or another. This is less comfortable, but describes the underlying systems more accurately.
There is no such thing as an "escaped string". Escaping is not a general concept, it is something that differs given the intended destination of that string. For example, "I am a %SYSTEM% person" is a perfectly fine escaped bash string, but an unsafe CMD string; it is also fine as a C# format string, but potentially unsafe as a C format string, depending on your actual implementation of printf; it is an escaped MSSQL filter string, but not an escaped PostgresSQL filter string.
Also, not all strings/texts should be thought of as Unicode code points/graphemes.
The biggest problem appears once you start wanting to combine such strings.
Say a user inputs some raw text in a form that is intended to be the title of a button in a HTML form that will be sent in a JSON file to be stored in a SQL db. The expectation is that you can later retrieve this HTML snippet from the DB and display it on the screen.
You have to first escape the raw text from the user so that it can be safely used in HTML - so you will go from user_input_string to html_pcdata_escaped_ user_input_string. Then you compose a bit of HTML that contains the button and this part; let's say you store it in some HTML DOM object. Then you want to send this HTML object as JSON, so you have to know to convert html_pcdata_escaped_ user_input_string into json_string_escaped_user_input_string - but that loses type information which may hurt us later, so maybe we want to actually store it as json_string_escaped_html_pcdata_escaped_user_input_string. Then, if we want to use this as part of an SQL query string, by the same considerations, we want to put it in a mysql_like_filter_escaped_json_string_escaped_html_pcdata_escaped_user_input_string - which is getting really ugly, and easy to mess up.
Of course, the order of escaping matters, so an mysql_like_filter_escaped_json_string_escaped_html_pcdata_escaped_user_input_string and a json_string_escaped_mysql_like_filter_escaped_html_pcdata_escaped_user_input_string are different things that need to be decoded differently (of course, for SQL in particular we could use prepared queries instead).
Also, we can't ever concatenate this with any other string-like type until perhaps the final use point (such as sending a query string to the DB), since we need to remember which part of the string is escaped in which way, and for what types of uses it is safe (an HTML-escaped string may still contain SQLi or JSON injection).
The point is that even with proper types, this is not easy to manage or fix.
It also requires quite advanced type systems to be able to use these in normal contexts - say, you want to store several such strings with different provenances in a Map or Set or even List, without "forgetting" the provenance.
As far as I'm concerned, "the point is" that what we're describing here -- keeping track of the type of each piece of data -- is the best way to think about this class of problems (as opposed to talking about "sanitising" or "safe data", for example).
It's then up to us to decide how to best make use of the type system of whatever language we end up implementing it in (or, indeed, to treat the ability to deal with this well as a requirement when we're choosing a language).
For me, effects like "we can't ever concatenate this with any other string-like type" are desirable features, not problems with this approach: either it's possible to convert both strings to a common form, or I shouldn't be trying to combine them.
Sure - I'm just saying that the article is right that this problem is difficult, not easy, and that it doesn't get significantly easier if we accurately keep track.
> The point is that even with proper types, this is not easy to manage or fix.
In practice, in a typed language, nothing like this ever occurs, because the rule is just: "use string for everything, except the edge".
You're thinking of a type like: HtmlString<JsonString<Utf8String>>>
In practice the type that is "passed around" is almost always just "string", and this is converted at the last moment to a single destination format, such as HtmlString.
When writing to databases, there isn't even an escape step at all, because you use parametrised queries, right? Right!?
The database stores "string", not "DatabaseEscapedString".
This is similar to how instants in time ought to be handled. You store them as UTC and convert to the user's time zone at the last moment. You don't pass around some monstrosity that somehow keeps track of +10-5+3 in order to arrive at +7. That would be absurd. Instead you pass around the "Z" UTC timestamp and add +7 when needed.
> In practice the type that is "passed around" is almost always just "string"
That's what happens in practice, of course. The GP was proposing something else, and I was explaining how complicated that gets.
> and this is converted at the last moment to a single destination format, such as HtmlString.
I explained before why this doesn't work unless we're talking about the final destination of this string. Otherwise, if that string is being taken through various encodings (say user input to JSON to sprintf format string to HTTP body), and if you need to combine safe and unsafe input, then what you're saying doesn't work anymore.
Here is a sketch of an example:
userInput := read()
jsonFormat := "{\"context\": \"%s\", \"input\": \"" + json.escape(userInput) + "\"}" //easy and safe
finalJson := ""
sprintf(finalJson, jsonFormat, "some context")
// oops - unsafe if original input was "%s"
sprintf(finalJson, printf.escape(jsonFormat), "some context")
// oops - does the wrong thing - it will output "{\"context\": \"%s\", \"input\": \"%s"\"}
//let's try the other way around?
userInput := read()
formatStr := "{\"context\": \"%s\", \"input\": \"" + printf.escape(userInput) + "\"}" //easy and safe
finalJsonStr := ""
sprintf(finalJson, formatStr, "some context")
// oops - unsafe if userInput was "safe-looking\", \"bypassAuth\": \"true\"}"
sprintf(finalJson, json.escape(formatStr), "some context")
// oops - does the wrong thing - it will output
// "\"{\\\"context\\\":\\\"some context\\\", \\\"input\\\": \\\"safe-looking\\\", \\\"bypassAuth\\\": \\\"true\\\"}\"" - that is, a JSON string instead of a JSON object
The only solution to get this to work is to keep the user input string entirely separate from any other string, and apply escaping to it individually at every level where it is used.
Additionally, you will need to remember what escaping has been applied to it, and in what order, so that it can be un-escaped back to the original value when needed.
Escape strings at the last possible moment, and ideally it's done by whatever library you're using so you never have to worry about it. It's never not been clear to me in our codebases if I'm dealing with a raw string or a safe one. They're all unsafe, because you have no clue what context they're going to be used in.
If you're writing a web framework or a DB library things might be different though - in that case a different class probably makes sense. If you have a module for a certain communication medium, then yeah you might use it in that module. But if you're writing a webapp, passing around escaped strings is a bad idea 99% of the time. It creates code highly coupled to one aspect of your system.
Just imagine if you did this with networking. I'm glad we're not in a world where we're passing around TCPString or UDPString or IPString or EthernetString or TokenRingString or CarrierPigeonString because that happens to be a networking stack the app uses sometimes. It sounds like hell.
> They're all unsafe, because you have no clue what context they're going to be used in.
That's correct, but it's the reverse thinking from the escaping one.
Because in the escaping one, when you need not to escape you will also not-escape at the last possible moment, and that's a sure-fire way to launder attacker-controlled data.
Instead you should escape everything, and opt-out as early as possible.
> But if you're writing a webapp, passing around escaped strings is a bad idea 99% of the time. It creates code highly coupled to one aspect of your system.
That's why you do the reverse: most strings are unsafe to everything, but the strings which are safe are generally safe to one specific subsystem. So you say that.
> Just imagine if you did this with networking. I'm glad we're not in a world where we're passing around TCPString or UDPString or IPString or EthernetString or TokenRingString or CarrierPigeonString because that happens to be a networking stack the app uses sometimes. It sounds like hell.
It sounds like hell because it makes no sense, there's no such thing as a TCPString because TCP is not string-based and TCP messages are not composed that way.
> Instead you should escape everything, and opt-out as early as possible.
That’s not even remotely workable for any system with more than one kind of “escaping”. What if I want to use a string as:
1. An IDNA-encoded domain name
2. An HTML text snippet
3. A shell command string argument
4. A string literal part of a regular expression
5. A part to be used in an XML CDATA section
6. A JSON string
I can’t escape the string beforehand, since the escaping rules are all different. No, the only sensible alternative is to use the same rule which we all use for character encoding: Encode and decode (and escape) at the edges.
> It’s not an issue, because by default nothing is safe anywhere, so all those APIs should treat the injected data thus.
No library does this, since it does not know what strings I send it with their literal meaning intended, and which strings I send it with their escape characters intended to be interpreted. The escape characters are part of the API of that library. The library does not accept “strings” as such, it accepts “escaped” strings. And since my program deals with normal unescaped strings, I have to escape the strings before I send them to the API.
> There is no escaping, because everything is automatically internally escaped by default.
I have a feeling that you have a different meaning of the word “escaped” than me.
In my future perfect language, char seqs cannot be cast. They must be converted. Basically syntactic sugar for Java-style char encoding infrastructure.
I have assumed that disallowing casting was sufficient. But now I'll have to ponder "taint" too. From the hip, I really like the notion of tracking the provenance of data, a la defensive programming.
No. The way to solve this is to recognize where the problem lies. The problem does not lie with storing user input. The problem lies with improperly putting strings in other data.
So all you need to do, is to do that properly. Either you commit to using constructs like paramtrized queries instead of concatenizing strings and use the DOM to put together HTML the way you want, or you escape as you concatenate the strings.
Don’t store escaped strings, it’s a recipe for disaster.
It really isn't. Proof: Most people who try, largely succeed. Those who do something silly like try to do it 100% manually generally rapidly realize that's not a good plan, and usually there is a not-very-hard way to encapsulate it somehow, since that's pretty much what our languages do, encapsulate things.
I'm not saying it's completely trivial or that there's never an issue here or there. What I'm saying is, it's on par with any of dozens of other issues in programming. Bugs happen, errors happen, but no more so than anyone else. A series of systems with slightly different encoding practices can also cause some headaches, but, again, these are on par with a number of other issues that can emerge in such systems, not especially bad. I've seen a lot of crappy code that gets this wrong at scale, written by programmers who don't really know or care what they're doing, but the same code was crap in a dozen other ways too, and generally screwed up even easier things as well.
Where you get the problems are, from largest to smallest, 1. People who don't realize it's an issue at all and concatenate everything and 2. People who have just been taught about it, and are doing a wrong thing, most often trying to filter on the way "in" instead of the way "out". ("Sanitize user input" delenda est. Stop saying it. It's wrong.) Which is also not an exceptional case, because again there are any number of things that have the exact same characteristics in the programming world.
I would expect "ridonkulously hard" to encompass something that even when tried is super hard and often a failure, and this isn't that case.
It's not, though. It's the easiest thing in the world: Just use a library that never emits unescaped content by default, or if you make a single-character typo.
The problem is that most of the libraries aren't that.
The back-end should see a 7-byte buffer with values [102 111 111 032 098 097 114], assume it's UTF-8 and convert that to its internal string representation?
no, the backend has no reason to see `foo%20bar` - you escape when you're combining that string with other strings (ie into HTML, into a SQL query, etc.)
Many database engines can handle arrays, or table-valued variables which are basically the same thing. Most ORMs will also abstract away arrays for you, so you as the developer never need to deal with escaping of data in arrays.
> It's the easiest thing in the world: Just use a library that never emits unescaped content by default
That doesn't make any sense? Escaping is a function of the consumer, not the producer. Hell, most of the problematic content doesn't come from a library to start with.
And if your Markdown -> HTML converter produces escaped content... it's not a Markdown -> HTML converter, because the result is not HTML.
More broadly, I think one of the core issues is this:
> Escape user input
User input is a broad and complicated category, and it's easy for user input to be "laundered" as it moves through an application.
And then escaping is an explicit action, which means it can be missed or forgotten, which is also a problem.
This means the solution is really that APIs should default to escaping most everything. Rather than having to mark "untrusted" content, it's trusted content which should be marked thus. "Escaping" is the wrong default.
But of course that doesn't solve all the issues. Like markdown, where you want the output of the Markdown converter to be trusted (otherwise the output won't be properly formatted on display), what you don't want trusted is the input, and that means you don't want the input to be laundered through the Markdown converter.
Which is an issue in most Markdown libraries, as they inherit the "trusted input" model from Gruber's original Markdown, where HTML passthrough was a feature.
In that sense one design I did enjoy is Jinja and Markupsafe in the Python ecosystem:
- Like most modern template libraries, Jinja escapes content by default.
- Also (though somewhat sadly) like most template libraries Jinja allows marking a value as safe at point-of-use, however that's dangerous as content can be mixed and it's easy for safe content to suddenly be swapped out for user input and become unsafe through seemingly unrelated changes.
- So a better method is to use `markupsafe.Markup` at the source, it's a string subclass which the library considers safe (because Jinja uses `markupsafe.escape` internally), the neat thing is any combination between a Markup instance and a non-Markup string will implicitly escape the non-Markup parameter(s).
This means you can mark safe content as safe at the source (where it's easy to prove it's safe because e.g. it's a literal), then most transformations will maintain the safety invariants. Though obviously it only works with content you know will ultimately be markup-injected.
And non-method APIs can't be overridden (e.g. re, or HTML/XML libraries) so they're not Markup-aware, they'll treat Markup objects as regular strings which that complicates processing pipelines if you want to conserve safety invariants. At the same time, those are laundering opportunities so care is useful.
« Escaping is a function of the consumer, not the producer »
This is incorrect. The producer emits something in a language, be it HTML or JSON or HTTP headers or whatever. Data must be encoded properly for that language. The consumer must then decode, of course, so in a sense it is the job of both. But the onus is really on the producer.
> This is incorrect. The producer emits something in a language, be it HTML or JSON or HTTP headers or whatever. Data must be encoded properly for that language.
Which is the consumption side. When you send data to an HTML template engine, it’s escaped as input, meaning with the template engine as consumer, not with the template engine as producer.
It may be a “pipeline” situation where the consumer also produces something (e.g. JSON or HTML), but it doesn’t have to be e.g. an SQL interface might have no production, but the data it consumes still needs to be properly escaped.
When your producer produces data, it has no idea how that data will be used, and that’s what determines the necessary transformations e.g. it’s of no help to you if your templating engine generates content escaped for MSSQL when you’re not going to put it in MSSQL.
> it’s of no help to you if your templating engine generates content escaped for MSSQL when you’re not going to put it in MSSQL.
Allow me to complain a bit about MSSQL.
When you're escaping a LIKE expression for MSSQL, you must also escape the "[" character, since it's a wildcard for MSSQL (and nowhere else except AFAIK Sybase). When you're escaping a LIKE expression for other databases, you must not escape the "[" character, since some databases reject escaping anything other than the % and _ wildcards. That is, your escaping code for a LIKE expression has to be database-specific, because MSSQL (and AFAIK Sybase, it seems both have a common ancestor) decided to be different.
> When you're escaping a LIKE expression for other databases, you must not escape the "[" character, since some databases reject escaping anything other than the % and _ wildcards. That is, your escaping code for a LIKE expression has to be database-specific, because MSSQL (and AFAIK Sybase, it seems both have a common ancestor) decided to be different.
TBF you may need custom codepaths because defaults diverge as well, IIRC postgres and sqlite default to ESCAPE '\' while mssql and oracle default to ESCAPE '' (the latter being the actual spec behaviour).
So in Postgres and SQLite you must always escape your LIKE parameter, while in mssql and oracle that's not the case.
The whole point is that the producer may be hostile, or buggy, and the consumer must handle that. Asserting that it “must” be encoded properly does not make it so.
That doesn't make sense to me and I agree with GP. If I consume HTML and I escape all HTML input I'm given, I'm utterly useless.
Now when I consume text and convert that text into HTML for further treatment, I'm producing HTML, and I must properly escape my input in that conversion. The escaping is only needed because I produce HTML. In fact the only time escaping can be done is when producing data, because if unescaped data is ever produced, the cat's out of the bag.
Edit: Actually think that producer/consumer is a wrong way to talk about this. Escaping only ever occurs at a boundary when transforming between formats (eg from "text string" to "html string") which is always both producer (of the new format) and consumer (of the old format). But it can always be thought of as a type cast, with possible type confusions when input and output formats share the same machine representation (eg string).
> That doesn't make sense to me and I agree with GP. If I consume HTML and I escape all HTML input I'm given, I'm utterly useless. [...] Now when I consume text and convert that text into HTML for further treatment, I'm producing HTML, and I must properly escape my input in that conversion.
Which is my point, it's the consumption side which defines what the escaping should be.
> Escaping only ever occurs at a boundary when transforming between formats (eg from "text string" to "html string") which is always both producer (of the new format) and consumer (of the old format).
A database interface is not a transformer / producer, needs escaping. Globbing is not a transformer either. Still needs escaping.
The thing that accepts the input must make sure it is properly escaped. Think of SQL injection attacks - they are because the thing that accepts input hasn't properly escaped the input.
Cross site scripting attacks are exactly the same thing but occur when the input side doesn't properly escape HTML input.
I've always found it more useful to just discard user input that doesn't come in the format you're asking for, and bail on the entire operation.
Like, if the user might be attempting something fishy, there's no reason to try and "clean it up" and have your program "do it's best" with the remainder. Throw an error back at the user and move on to the next query.
that sounds awful. you probably reject phone numbers that use spaces instead of dashes or something? if its correctable, just correct it and don't hassle the user.
if its ambiguous, then fine, ask the user to clarify.
This is a space where type systems can be extremely helpful.
Escaped input and unescaped input are separate types. And a robust type system will allow you to craft your functions so that the streams cannot be crossed without going through translation layers.
In fact, the most robust type systems will offer things like automatic function composition so that you have to write a minimum of code... If a type coercion function is available, the type system can be taught to just automatically apply that coercion function before dropping the string into the relevant processing.
"Most of the rendering bugs I’ve seen in security audits don’t matter. This is not how your organization will be pwned. ... What would fix this? Layered security built around a plausible threat model. What would not help? Removing reflected ASCII text from Shodan’s API error message. I’m not saying that small security bugs aren’t worth fixing, or that organizational security always trumps application security. Rather, real damage usually does not come from where security engineers tend to expect, because they spend their time on pentests and CTFs that differ substantially from the approaches popular among actual attackers."
Everyone commenting that dealing with user input is easy: if it were really easy, we wouldn't keep making the same mistakes. I fixed my first SQL injection attack by switching some code to bind variables over 20 years ago, yet we still have Little Bobby Tables showing up in our collective databases. The fix may be easy ("just do X"), but the mistake is even easier.
Breadth-first security attacks will exploit input sanitizing exploits like that. Security audits can certainly help with that, assuming they don't impose a huge security infrastructure and review process that crushes developer productivity, which always seems to happen.
Depth-first attacks as described are a different class of attack, and of course "audit" won't help that much. Education, penetration testing, and honeypots are some of the stuff that works for that.
Ultimately, if an organization treats its work force like crap, then depth-first attacks are unstoppable. The crypto-locker attackers are strangely pro-worker, because it highlights how disgruntled employees are such effective attack vectors via bribery, vengeance, or apathy.
After reading the article I fail to see what is hard about escaping user input.
It seems like what the author means is that it's hard to think of all the places where user input should be escaped, but even then, if you use any modern framework, everything is escaped by default.
Ruby on Rails pretty much handles this. Regular strings are always escaped in views. Only html_safe strings will emit html. For user input, you should always use the sanitize method instead of raw. :)
I use a strongly typed language, repository pattern and an ORM. Good luck trying SQL injections. Also input is sanitized at framework level so good luck with XSS.
Also the input has to bypass validation (for which I have unit tests) and the DTOs are mapped to database models before being written.