Hacker News new | ask | show | jobs
by clarkevans 4687 days ago
I should have used properly "construct input" (as suggested by dragon-writer) rather than "escape". If possible, you'd use query parameters or some other reusable and verified mechanism to encode user content. If you're embedding a data stream of one type within another, you still have to manage boundaries to encode content properly.

I'd note for database languages, query construction is often two phase, where you build your query template based on selection/filtering needs, and then, use parameter substitution with user-provided content. I've seen experienced programmers do both at the same time and resort to manual escaping of user content rather than seeing them as distinct phases.

I think it's about managing the movement of information between contexts... one process' data is another process' code. The most important context switch happening between untrusted inputs and the rest of your system.

I apologize for deleting my other comment; I was trying to shorten it.

1 comments

I'll move my reply here, then.

You make good points, and I'm trying to figure out exactly how your points and mine hook together.

I think what it ultimately comes down to is distinguishing between data and code. More specifically, data can contain code, but code should never contain (foreign) data. So escaping in general is fine, but escaping in this context, where you're building a string of code, is no good.

That, I think, is why I'm comfortable with e.g. the escaping done to a string when emitting JSON, but not with escaping a string to put it into an SQL query.

With parameterized queries, you're still ultimately passing the code and data over to the database over some sort of stream (assuming an out-of-process database server), but hopefully that stream is designed as data containing both the parameterized SQL code and the parameters, rather than just SQL code with escaped parameters.

Doesn't the database server just escape the data in the parametrized query? So if you used the database's escape functions directly wouldn't it be the same thing? I guess if you are arguing that developers are more likely to mess it up than the database server then I see your point.
Not usually. What usually happens is something like this:

    SELECT * FROM bar_table WHERE x = 123;
Is converted into a datastructure like this in the database engine (really simplified):

    {
      "querytype": "select",
      "fields": ["bar_table.x", "bar_table.y", "bar_table.z"],
      "tables": ["bar_table"],
      "where_filters": ["x", "=", 123]
    }
Which is then converted into a list of things to do to run that query.

Now, if you give it a prepared statement:

    SELECT * from bar_table WHERE x = ?;
The database engine can convert that into a closure, something like this:

    function(arg1){
      return {
        "querytype": "select",
        "fields": ["bar_table.x", "bar_table.y", "bar_table.z"],
        "tables": ["bar_table"],
        "where_filters": ["x", "=", arg1]
      };
    }
Then when you run the prepared query, it can simply call the closure, putting the right value in the datastructure without escaping anything, and continuing on from there.

Of course, the realities of it are more complicated than that for optimization reasons, but that's the basis of it.

Makes sense. Thanks for the explanation.
I sure hope not. If it does, then I'm going to be very sad, and my opinion of humanity will be ever so slightly lowered.