Hacker News new | ask | show | jobs
by teddyh 1325 days ago
An alternate view: “string” is not a granular enough type, just like “bitfield” is not a type. Firstly, a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4), and you absolutely need to know which it is. But let’s assume that you’ve been a diligent programmer and filtered all that at the edges, and now have a sequence of Unicode code points (or possibly graphemes). You still need to know the escaped-ness of the string! This is also a form of typing. Perl was early with its concept of “tainted” strings, but modern languages can use types to mark this concept in the code. At all points in your code, you should be sure what type the value you have is. If you need to use the types in your language to ensure this, then use types. But make sure of it somehow.
5 comments

> Firstly, a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4), and you absolutely need to know which it is.

This is a language defect. If your language was invented in the 1960s it's an understandable defect, but it's still a defect. I do not want to write computer software with strings in a language that doesn't even have an actual string type rather than "Eh, maybe this is a string or maybe it's just some random bytes, who cares".

Only in very low level software should it make a difference whether the string is in fact represented as UTF-8 or UTF-16 or whatever, but Rust shows that you can write software at a low level and still enforce type safety for strings.

I agree though that here once again the Right Thing™ is a strong type system. If I've got a Microsoft Graph username, a URL, an email address and a UUID, that's four types, those are not four strings with human names to distinguish them. We don't need to escape some or any of these types - in their context.

A type system isn't going to save you from users submitting all kinds of potentially different encodings. Which also depends on what kind of user input is being handled: Is it OS-provided UI? Is it something being sent to a service accessible on the internet? Is it from a CLI? Is it from a file? Context matters for the potential space of what kind of data you might be operating on, which could require different ways of either knowing what kind of data you have based on having more control over the input versus having to detect stuff (or be told, correctly) from highly arbitrary things like reading from a file. All of that is external to the type system, and requires doing something before you can tag it with the correct type. Some languages might attempt to detect this stuff for you, but that could potentially be considered a language defect if it's hard to detect what a string is without having other input telling you what that string contains, such as a header in an HTTP request saying that it's UTF-8.
"A type system isn't going to save you from users submitting all kinds of potentially different encodings."

Yes, it is, because you give that a type that indicates you don't know what the encoding is, like RawInput or something. You then can not pass this type to any other function that doesn't explicitly call for that type. If you have some function that accepts it, blindly casts it to UTF-8, and slams it out into a file, well, that's not the type system's fault [1].

Of course a type system won't prevent you from still just being wrong or writing bugs; nobody promises that, not even the formal methods advocates. But it will prevent you from just accidentally blindly shoveling it out somewhere it doesn't belong without ever examining it or thinking about it.

I think you may be believing in a popular myth about strong typing systems, that they are designed to somehow prevent bad data from coming in to your system at all. You correctly identify that as impossible. But what strong typing systems can do is force you to deal with the fact that bad data may be coming in. On the outside, you have the chaos of, say, a bag of bytes that may or may not be JSON. On the inside, you have a "type SomeStruct { int a; int b }". A strong type systems forces you to write some sort of adapting code between those two, and guarantees that the result of that adapting code will be only and exactly the type that comes out of that adapting code, no "whoops, sometimes this dynamic code just returns a string, or maybe a network socket, or who knows what". Nothing can prevent your HTTP API from receiving a JPG of an anime character instead of JSON specifying a user to delete, but a strong type system can make you deal with that immediately and fully, instead of garbage data of indeterminate type floating through the system for an indeterminate period of time.

[1]: Also note there are a lot of "strong type systems" in the world that still fail to take advantage of their own capabilities and let bare string types and such float around too much. There are reasons why libraries must support the lowest common denominator; a file is a series of bytes with no further constraints, so the lowest level API has no choice but to accept that, but higher level APIs should more often take more restricted types. That strong type systems can save you from this doesn't mean they all do. I have a number of wrapper types in various languages just to add these guarantees to my programs not provided by the underlying libraries, though I also have some code that just wraps the underlying libraries that can't help but correctly take raw bytes at the lowest level.

>If you have some function that accepts it, blindly casts it to UTF-8

Unfortunately, if you interact with services you didn't write, you're usually back to getting "strings" of unknown encoding, and typically requirements that force some blind or semi-blind guessing.

Blind guessing is not related to the type system. Nobody has claimed type systems can solve that. What they can do is force you to guess, and make it clear where that is occurring.

This, again, goes back to a very broken understanding of types systems that I often see, and once held myself. The claim of type systems is not that they magically go out into the world and fix the external world to be well-typed; the claim is that it forces your code to deal with the conversion of the external world into a clean internal representation, and presumably, to have a clean error pathway when that fails. Dynamically-typed code will let you float along much more easily. Statically-typed code can still be written that way, but at least then it's poor statically-typed code. In some circles that sort of broken dynamic code is essentially idiomatic. (Though that is fading away as every year more programmers learn how bad an idea that is.)

I agree with that if you qualify it with "sometimes". Strong types can force you to guess, sometimes. Other times, the data fits the type but isn't the type.
If the language explicitly says how strings are defined, libraries that go "Eh, I'll just shove nonsense bytes in this data structure and claim that's a string" are broken by definition.

That's just as true in Java as in Rust. The problem is languages like C++ or D which just don't care and have a "string" type that might just be some bytes.

I don't mean libraries, I mean external services. Ambiguous strings are everywhere.
The way Rust does it is IMO interesting. There is e.g. an OsStr for strings that e.g. describe filenames in an directory listing, because these could actually be invalid UTF-8 but your program might still need to be able to handle them.

So when you wanna convert that OsStr to a String you are forced to handle this in one way or another. This is less comfortable, but describes the underlying systems more accurately.

There is no such thing as an "escaped string". Escaping is not a general concept, it is something that differs given the intended destination of that string. For example, "I am a %SYSTEM% person" is a perfectly fine escaped bash string, but an unsafe CMD string; it is also fine as a C# format string, but potentially unsafe as a C format string, depending on your actual implementation of printf; it is an escaped MSSQL filter string, but not an escaped PostgresSQL filter string.

Also, not all strings/texts should be thought of as Unicode code points/graphemes.

Sure, you need a different type for each different form of escaping you want to track, but that doesn't make the idea unworkable.

A type that says (say) "this is a string containing html PCDATA" is a useful thing to have.

The biggest problem appears once you start wanting to combine such strings.

Say a user inputs some raw text in a form that is intended to be the title of a button in a HTML form that will be sent in a JSON file to be stored in a SQL db. The expectation is that you can later retrieve this HTML snippet from the DB and display it on the screen.

You have to first escape the raw text from the user so that it can be safely used in HTML - so you will go from user_input_string to html_pcdata_escaped_ user_input_string. Then you compose a bit of HTML that contains the button and this part; let's say you store it in some HTML DOM object. Then you want to send this HTML object as JSON, so you have to know to convert html_pcdata_escaped_ user_input_string into json_string_escaped_user_input_string - but that loses type information which may hurt us later, so maybe we want to actually store it as json_string_escaped_html_pcdata_escaped_user_input_string. Then, if we want to use this as part of an SQL query string, by the same considerations, we want to put it in a mysql_like_filter_escaped_json_string_escaped_html_pcdata_escaped_user_input_string - which is getting really ugly, and easy to mess up.

Of course, the order of escaping matters, so an mysql_like_filter_escaped_json_string_escaped_html_pcdata_escaped_user_input_string and a json_string_escaped_mysql_like_filter_escaped_html_pcdata_escaped_user_input_string are different things that need to be decoded differently (of course, for SQL in particular we could use prepared queries instead).

Also, we can't ever concatenate this with any other string-like type until perhaps the final use point (such as sending a query string to the DB), since we need to remember which part of the string is escaped in which way, and for what types of uses it is safe (an HTML-escaped string may still contain SQLi or JSON injection).

The point is that even with proper types, this is not easy to manage or fix.

It also requires quite advanced type systems to be able to use these in normal contexts - say, you want to store several such strings with different provenances in a Map or Set or even List, without "forgetting" the provenance.

As far as I'm concerned, "the point is" that what we're describing here -- keeping track of the type of each piece of data -- is the best way to think about this class of problems (as opposed to talking about "sanitising" or "safe data", for example).

It's then up to us to decide how to best make use of the type system of whatever language we end up implementing it in (or, indeed, to treat the ability to deal with this well as a requirement when we're choosing a language).

For me, effects like "we can't ever concatenate this with any other string-like type" are desirable features, not problems with this approach: either it's possible to convert both strings to a common form, or I shouldn't be trying to combine them.

Sure - I'm just saying that the article is right that this problem is difficult, not easy, and that it doesn't get significantly easier if we accurately keep track.
> The point is that even with proper types, this is not easy to manage or fix.

In practice, in a typed language, nothing like this ever occurs, because the rule is just: "use string for everything, except the edge".

You're thinking of a type like: HtmlString<JsonString<Utf8String>>>

In practice the type that is "passed around" is almost always just "string", and this is converted at the last moment to a single destination format, such as HtmlString.

When writing to databases, there isn't even an escape step at all, because you use parametrised queries, right? Right!?

The database stores "string", not "DatabaseEscapedString".

This is similar to how instants in time ought to be handled. You store them as UTC and convert to the user's time zone at the last moment. You don't pass around some monstrosity that somehow keeps track of +10-5+3 in order to arrive at +7. That would be absurd. Instead you pass around the "Z" UTC timestamp and add +7 when needed.

> In practice the type that is "passed around" is almost always just "string"

That's what happens in practice, of course. The GP was proposing something else, and I was explaining how complicated that gets.

> and this is converted at the last moment to a single destination format, such as HtmlString.

I explained before why this doesn't work unless we're talking about the final destination of this string. Otherwise, if that string is being taken through various encodings (say user input to JSON to sprintf format string to HTTP body), and if you need to combine safe and unsafe input, then what you're saying doesn't work anymore.

Here is a sketch of an example:

  userInput := read()
  jsonFormat := "{\"context\": \"%s\", \"input\": \"" + json.escape(userInput) + "\"}" //easy and safe
  finalJson := ""
  sprintf(finalJson, jsonFormat, "some context") 
  // oops - unsafe if original input was "%s"
  sprintf(finalJson, printf.escape(jsonFormat), "some context") 
  // oops - does the wrong thing - it will output "{\"context\": \"%s\", \"input\": \"%s"\"}
  
  //let's try the other way around?

  userInput := read()
  formatStr := "{\"context\": \"%s\", \"input\": \"" + printf.escape(userInput) + "\"}" //easy and safe
  finalJsonStr := ""
  sprintf(finalJson, formatStr, "some context") 
  // oops - unsafe if userInput was "safe-looking\", \"bypassAuth\": \"true\"}" 
  
  sprintf(finalJson, json.escape(formatStr), "some context") 
  // oops - does the wrong thing - it will output 
  // "\"{\\\"context\\\":\\\"some context\\\", \\\"input\\\": \\\"safe-looking\\\", \\\"bypassAuth\\\": \\\"true\\\"}\"" - that is, a JSON string instead of a JSON object
The only solution to get this to work is to keep the user input string entirely separate from any other string, and apply escaping to it individually at every level where it is used.

Additionally, you will need to remember what escaping has been applied to it, and in what order, so that it can be un-escaped back to the original value when needed.

You're basically running around with your finger on the trigger and suggesting that everyone everywhere ought to wear ballistic armour to compensate.

This is how you put the safety on and return the gun into its holster:

    using System;
    using System.Text.Json;
    
    string maliciousInput = "{0} % $0 -- DROP TABLE \"USERS\"";
    
    // Always, always, ALWAYS use a proper serializer for assembling formats like JSON.
    // The malicious input can include actual JavaScript, and it'll be correctly encoded with 100% safety.
    string encoded = JsonSerializer.Serialize( new {
        context= "{0}",      // .NET format string placeholder
        input=maliciousInput
    });
    
    // This will just work, formatting placeholders are ignored if no parameters are specified
    Console.WriteLine(encoded);
    
    // A safe FormatException is thrown if you mis-use the string formmating code. 
    // No vulnerability other than DDoS.
    Console.WriteLine(encoded, "adfasfd");

Test here: https://dotnetfiddle.net/p8P1fO

Fundamentally, putting any format like JSON or any user-controlled input into the first parameter of sprintf or any similar function in any language is Wrong with a capital W. It ought to be picked up in code review.

Ideally, sprintf-like functions in strongly typed languages should use a special "FormatString" type instead of a plain string as the first input. This would automatically fix any such issues, but relying on this is still problematic. Naively printing potentially malicious input to places like the console is still quite dangerous, no matter how much you escape it! Logs can be captured into systems that then paste it directly into HTML. Similarly, console control codes can be used by attackers as a nuisance. Etc... Structured logging, along the lines of OpenTelemetry is safer.

See: https://owasp.org/www-community/attacks/Log_Injection

This is the safe equivalent of your second example. Both format strings and JSON are correctly handled:

    Console.WriteLine( "{0}", JsonSerializer.Serialize( new {
     context="{0}", // if sprintf/WriteLine is not misused delibaretely, this is safe!
     input="safe-looking\", \"bypassAuth\": \"true\"}" 
    }));
This outputs:

    {"context":"{0}","input":"safe-looking\u0022, \u0022bypassAuth\u0022: \u0022true\u0022}"}
Link: https://dotnetfiddle.net/Lm8jkR
(Just for the record, I agree completely, and nothing of what I wrote should be construed as contradicting any of that.)
Escape strings at the last possible moment, and ideally it's done by whatever library you're using so you never have to worry about it. It's never not been clear to me in our codebases if I'm dealing with a raw string or a safe one. They're all unsafe, because you have no clue what context they're going to be used in.

If you're writing a web framework or a DB library things might be different though - in that case a different class probably makes sense. If you have a module for a certain communication medium, then yeah you might use it in that module. But if you're writing a webapp, passing around escaped strings is a bad idea 99% of the time. It creates code highly coupled to one aspect of your system.

Just imagine if you did this with networking. I'm glad we're not in a world where we're passing around TCPString or UDPString or IPString or EthernetString or TokenRingString or CarrierPigeonString because that happens to be a networking stack the app uses sometimes. It sounds like hell.

> They're all unsafe, because you have no clue what context they're going to be used in.

That's correct, but it's the reverse thinking from the escaping one.

Because in the escaping one, when you need not to escape you will also not-escape at the last possible moment, and that's a sure-fire way to launder attacker-controlled data.

Instead you should escape everything, and opt-out as early as possible.

> But if you're writing a webapp, passing around escaped strings is a bad idea 99% of the time. It creates code highly coupled to one aspect of your system.

That's why you do the reverse: most strings are unsafe to everything, but the strings which are safe are generally safe to one specific subsystem. So you say that.

> Just imagine if you did this with networking. I'm glad we're not in a world where we're passing around TCPString or UDPString or IPString or EthernetString or TokenRingString or CarrierPigeonString because that happens to be a networking stack the app uses sometimes. It sounds like hell.

It sounds like hell because it makes no sense, there's no such thing as a TCPString because TCP is not string-based and TCP messages are not composed that way.

> Instead you should escape everything, and opt-out as early as possible.

That’s not even remotely workable for any system with more than one kind of “escaping”. What if I want to use a string as:

1. An IDNA-encoded domain name

2. An HTML text snippet

3. A shell command string argument

4. A string literal part of a regular expression

5. A part to be used in an XML CDATA section

6. A JSON string

I can’t escape the string beforehand, since the escaping rules are all different. No, the only sensible alternative is to use the same rule which we all use for character encoding: Encode and decode (and escape) at the edges.

> I can’t escape the string beforehand, since the escaping rules are all different.

You’re still misunderstanding. You shouldn’t escape at any point, instead you should mark things as safe as early as possible.

“Safe” almost always has a single context, you don’t care if it’s going to go somewhere else because it’s not safe for there.

Anything that’s not marked as safe is then automatically considered unsafe and processed as such by the sink.

> What if I want to use a string as:

It’s not an issue, because by default nothing is safe anywhere, so all those APIs should treat the injected data thus.

There is no escaping, because everything is automatically internally escaped by default.

> It’s not an issue, because by default nothing is safe anywhere, so all those APIs should treat the injected data thus.

No library does this, since it does not know what strings I send it with their literal meaning intended, and which strings I send it with their escape characters intended to be interpreted. The escape characters are part of the API of that library. The library does not accept “strings” as such, it accepts “escaped” strings. And since my program deals with normal unescaped strings, I have to escape the strings before I send them to the API.

> There is no escaping, because everything is automatically internally escaped by default.

I have a feeling that you have a different meaning of the word “escaped” than me.

> No library does this

Most modern templates do exactly that. Jinja certainly does.

> The library does not accept “strings” as such, it accepts “escaped” strings. And since my program deals with normal unescaped strings, I have to escape the strings before I send them to the API.

That’s the problem with the library. That is what needs to be fixed.

> I have a feeling that you have a different meaning of the word “escaped” than me.

Add “explicit” to the first occurrence if you don’t understand without it.

> a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4)

Agreed. My future perfect programming language has the predefined types 'ascii', 'utf-8', 'url', 'base64', etc. for misc kinds of character sequences.

Just like how raw bits are different from numerals: short vs byte, word vs int, 64-bits vs double, etc.

(Any one have a better naming system for 8, 16, 32, and 64 bit chunks of raw data? 'byte', 'word', 'doubleword', 'quadword'?)

Per this "ridonkulously hard" OC article, I'll also ponder predefined types for raw 'html5', 'json', etc (as in unparsed, char sequence vs DOM).

--

> Perl was early with its concept of “tainted” strings.

Not being a Perl dev, I'm unfamiliar with "taint". Quickly found articles like this: https://www.geeksforgeeks.org/perl-taint-method/

In my future perfect language, char seqs cannot be cast. They must be converted. Basically syntactic sugar for Java-style char encoding infrastructure.

I have assumed that disallowing casting was sufficient. But now I'll have to ponder "taint" too. From the hip, I really like the notion of tracking the provenance of data, a la defensive programming.

Great idea. Thanks.

No. The way to solve this is to recognize where the problem lies. The problem does not lie with storing user input. The problem lies with improperly putting strings in other data.

So all you need to do, is to do that properly. Either you commit to using constructs like paramtrized queries instead of concatenizing strings and use the DOM to put together HTML the way you want, or you escape as you concatenate the strings.

Don’t store escaped strings, it’s a recipe for disaster.