Hacker News new | ask | show | jobs
by k4st 4797 days ago
The SQL generated appears correct, unless the underlying database can guarantee that all foreign key constraints are met. That is, I consider this a failure of the user of the ORM to appreciate that the two invocations of filter are not identical.

In the former case, the query is verifying that the object_id field cannot be used to find a foreign object--regardless of the value of object_id. This is exactly what it is asked to do.

In the latter case, the query is simply verifying that object_id is NULL, which is exactly what it's asked to do.

3 comments

"other_id" and "other" here refer to two different ways of referring to a many-to-one reference between "somemodel" and "othermodel". Asking for rows of "somemodel" where "object_id" is None is the exact same thing as asking for rows from "somemodel" where its reference to "object" is None. Both should produce the same query, the simple one without the JOIN. The query with the JOIN is completely wasteful and not at all correct - it does a LEFT OUTER JOIN to the remote table, only to filter on those rows where the remote table has no match; but this is already obvious from whether or not "somemodel.other_id" is NULL.
You're right in that they are largely the same action, but consider this case:

An Owner is deleted without setting the pets.owner_id to NULL. So now there's a row in Pets with an owner_id referring to an Owner that doesn't exist.

These two queries will (rightfully) return different things in this case.

One will return The Pets who have null for an owner_id. The other will return the Pets whose owner_id is NULL AND those who do have an owner_id, but it doesn't reference an existing row in Owner.

OK, you're right in that owner_id can refer to a nonexistent entry, if the schema both doesn't make correct use of constraints and is also referentially corrupt. Designing the ORM to jump through huge hoops to suit the case that the user is using the relational database incorrectly, in such a way that enormous performance overhead is added to the use case as used correctly in the vast majority of cases, is IMO a poor design decision.
Great point, I didn't even think of that! The join there is for a reason, and that you're getting the entire 'model' not a field. If someone thinks what ORM does in this case is stupid, they probably think table constraints are stupid.
So the ORM is too stupid to know that in the case the other_id is NULL, it doesn't need to create a join subquery. But would you rather the PostgreSQL query planner figure this out or do it yourself in python? I would like to see the time difference both queries take, and I don't think I would want Django ORM to do this.
I think you should run EXPLAIN ANALYZE on both queries and see what you find.
I now understand that it won't be optimized away because it's doing something, and I think I still want it to do that something. I've run into plenty of ORM limitations and use custom SQL for procedures and custom joints, but I still wouldn't call Django ORM stupid for doing what it's doing, in this case.
Correct, these are semantically different. You’ve described it well. The ORM would be incorrect if it did not generate the SQL that it did.

Part of the issue is confusing object data with metadata. The id is an implementation detail of the data store – metadata. What the user is trying to do here is “talk database” and “talk model” at the same time.

Which is a perfectly good argument against an ORM. But one should commit to a metaphor, or not: http://clipperhouse.com/2012/02/29/suspension-of-disbelief/

If the schema is known to have a FOREIGN KEY constraint on sometable.other_id, then the SQL is as close as can be to wrong without actually returning the wrong answer. In the presence of the FK, the two queries have the same intent. But the first produces a vastly complex query plan for no reason, and this has nothing at all to do with the application/model layer.

Edit: plus! even if you want the same answer assuming that FKs are not in place and that bogus values might be present in other_id, the query is still far less efficient than it should be. You should be doing this:

    select * from sometable where not exists 
    (select 1 from othertable where othertable.id=sometable.other_id)
compare the query plans on any reasonable database and see (and yes, SQLAlchemy produces the NOT EXISTS form when the relationship is a one-to-many versus many-to-one and you ask it for objects with empty collections).
“Wrong but works correctly” is a pretty novel definition of wrong.

The argument you are making is that this particular ORM needs optimizations. Which is true! The point stands that the semantics of the two expressions is different, and thus should do different things. The point also stands that the user is mixing metaphors.

If the ORM can be informed of the guarantee that the FK constraint provides, and optimize accordingly, that’s good too. But this doesn’t tell us much about ORMs, except that they can be improved.

Also, a database that has two different query plans for queries that are logically equivalent…needs improvement.

Yes, the first querry is correct, and the second one will return incorrect results if you don't keep the consistency of your database in check.

No, the Django ORM is not correct in unsing the correct query by default. At least not on every database backend. That's because on any database that enforeces constraints, it already created the necessary checks, and altough functionaly they are equivalent, that line can mark the difference between a 5ms or a 45 minutes runtime (that's what happens with my data, not a hypotetical case). There is more to a framework than mathematical correctness.

But then, it should certainly have an option to always use the correct query, and it is a lot of implementation work that I can understand quite well that Django developers don't want to do now. They probably have other priorities.

And by the way, as I said, this is a problem to me (but no, not important enough that makes me fix it now, maybe later), but did I stop using an ORM just because of an abstraction leakage? Of course not, I encapsulated a solution to this problem and gone on, earning lots of man-hours at the 99% of my code where Django's ORM doesn't leak. That's my main disagreement with the article. Yes, ORMs are stupid, but not using one just because of that is stupid.

And yes, if you let Django templates go, you can easily cut 9ms of your response time. That's great! I wonder how much you'll cut if you rewrite it all on assembly.