Hacker News new | ask | show | jobs
by betterunix 4688 days ago
"this is done by escaping inputs"

Something about that sounds wrong. It is as though you are suggesting that we use in-band signaling with a bunch of notch filters to ensure that Cap'n Crunch whistles cannot be used to get free calls. The right answer is out-of-band signaling -- in other words, not constructing queries / command strings / etc. from user inputs.

Major SQL databases all support prepared statements; this seems like a far more robust approach than trying to sanitize inputs.

1 comments

I think that comment is a succinct summary of why SQL injections are still common, anyway. The answer: because people think "escaping" is the answer.

Imagine constructing a function in some other language this way:

    function square(x) {
        code = x + "*" + x;
        return eval(code);
    }
This is obviously wrong. Now, let's say I suggest fixing it with escaping:

    function square(x) {
        code = escape(x) + "*" + escape(x);
        return eval(code);
    }
Any sane programming community would string me up by my thumbs for doing this, and rightly so.

Yet, do the exact same thing in SQL and it's just business as usual. Occasionally people talk about prepared statements and parameterized queries and such, but usually people just talk about escaping. As if the problem is just that you need more backslashes, not that you're glomming strings together at runtime and then calling eval() on them.

It's so bizarre. People would think you're crazy for doing it in one environment, but doing it in a different environment where the exact same problems are present is just fine! I can't fathom how we got to this place.

I should have used properly "construct input" (as suggested by dragon-writer) rather than "escape". If possible, you'd use query parameters or some other reusable and verified mechanism to encode user content. If you're embedding a data stream of one type within another, you still have to manage boundaries to encode content properly.

I'd note for database languages, query construction is often two phase, where you build your query template based on selection/filtering needs, and then, use parameter substitution with user-provided content. I've seen experienced programmers do both at the same time and resort to manual escaping of user content rather than seeing them as distinct phases.

I think it's about managing the movement of information between contexts... one process' data is another process' code. The most important context switch happening between untrusted inputs and the rest of your system.

I apologize for deleting my other comment; I was trying to shorten it.

I'll move my reply here, then.

You make good points, and I'm trying to figure out exactly how your points and mine hook together.

I think what it ultimately comes down to is distinguishing between data and code. More specifically, data can contain code, but code should never contain (foreign) data. So escaping in general is fine, but escaping in this context, where you're building a string of code, is no good.

That, I think, is why I'm comfortable with e.g. the escaping done to a string when emitting JSON, but not with escaping a string to put it into an SQL query.

With parameterized queries, you're still ultimately passing the code and data over to the database over some sort of stream (assuming an out-of-process database server), but hopefully that stream is designed as data containing both the parameterized SQL code and the parameters, rather than just SQL code with escaped parameters.

Doesn't the database server just escape the data in the parametrized query? So if you used the database's escape functions directly wouldn't it be the same thing? I guess if you are arguing that developers are more likely to mess it up than the database server then I see your point.
Not usually. What usually happens is something like this:

    SELECT * FROM bar_table WHERE x = 123;
Is converted into a datastructure like this in the database engine (really simplified):

    {
      "querytype": "select",
      "fields": ["bar_table.x", "bar_table.y", "bar_table.z"],
      "tables": ["bar_table"],
      "where_filters": ["x", "=", 123]
    }
Which is then converted into a list of things to do to run that query.

Now, if you give it a prepared statement:

    SELECT * from bar_table WHERE x = ?;
The database engine can convert that into a closure, something like this:

    function(arg1){
      return {
        "querytype": "select",
        "fields": ["bar_table.x", "bar_table.y", "bar_table.z"],
        "tables": ["bar_table"],
        "where_filters": ["x", "=", arg1]
      };
    }
Then when you run the prepared query, it can simply call the closure, putting the right value in the datastructure without escaping anything, and continuing on from there.

Of course, the realities of it are more complicated than that for optimization reasons, but that's the basis of it.

Makes sense. Thanks for the explanation.
I sure hope not. If it does, then I'm going to be very sad, and my opinion of humanity will be ever so slightly lowered.
> Occasionally people talk about prepared statements and parameterized queries and such, but usually people just talk about escaping.

Occasionally? For many years "preferred prepared statements and use user input, sanitized (via escaping and/or more involved means) or not, only with a very special need that where prepared statements don't do what you need, and then be as restrictive as possible in what you accept" has been what I've heard everywhere.

Lucky you! I've not seen it be so common. Obviously the smart guys who properly understand what's going on are all about parameterized queries, but I still see a ton of other people talking about escaping.

To double-check and make sure I'm not just being biased, I did a Google search for "php mysql tutorial" and read through the top five results. Not a single one mentioned parameterized queries. One of the tutorials didn't mention using foreign data in queries at all, just hard-coded query strings. Of the other four, two escaped parameters and two just put everything in single quotes and called it a day.

The PHP community has an unusually reckless disregard for proper programming practices. Poisonously bad tutorial sites like w3schools are not helping either.

Java, Perl, Ruby, Python, C#, NodeJS and virtually other language or framework strongly encourages the use of placeholders. PHP's community stands alone in stubborn opposition to this despite having facilities to do this.

PHP is also unusually hostile to frameworks and ORM-like database layers even when these would solve a myriad of problems without imposing too greatly.

Honestly, on the whole PHP programmers in general are like hunter gatherers who see no use for anything other than the most basic of tools. There are exceptions, but I figure these people usually graduate to other communities where the best practices they follow have already been fully embraced.

I completely agree. However, the PHP community is still a huge part of the server-side web community, so they're largely representative even if nobody else follows their lead.
Representative of what? No other community follows PHP. If anything, PHP is trying to imitate other languages with varying degrees of success.
> I did a Google search for "php mysql tutorial"

I have no problem acknowledging that the dominant advice in the domain of PHP MySQL tutorials has been to use escaping, and that that's a serious problem.

Clearly, we were thinking about different scopes.

Unfortunately, the scope for many web applications, even big and popular ones, seems closer to mine than yours.
>Any sane programming community would string me up by my thumbs for doing this, and rightly so.

How do you explain shell scripting then? Bash and co. have been 'industry best practice' for decades. They feature all these problems of in-band communication and (attempts at) escaping and nobody seems to consider it a problem.

Shell scripting is another insane domain where stuff that would be crazy everywhere else is suddenly accepted. Fortunately, they at least recognize that it's usually a bad idea to feed potentially hostile data to local scripts.
Thank you for stating what seems like it ought to be an obvious truth, with such a stark example :)