Hacker News new | ask | show | jobs
by bottled_poe 1942 days ago
> A simple example is, say, foreign keys. Trying to access the foreign key of an object by doing `book.user.id` does an additional query for the user table to get the ID. It's less known that the id is immediately available by just doing `book.user_id` instead.

Hmm.. Sounds like a bug. Why is this not the same value for a foreign key?

4 comments

Well, it is the same value, but the ORM doesn't handle book.user.id any differently than it does book.user.name where it isn't the same value, and thus the only option is to fetch the second table. So it's not a bug, it's really just the ORM being consistent in how it handles queries, thus missing out on a possible optimization in this special case, where a simpler query could have given the same result.
No, that is clearly a bug. The ORM already has the value of book.id, that's how it knows how to fetch the right book. Performing extra queries is just poor implementation.
The programmer also already had the value in book.user_id but still chose to ask the ORM to fetch all of .user so they could get .id from there instead. And they might then afterward call .name on it as well, and there would be no further queries, because the ORM has already been asked to fetch all fields of .user - so it might in fact have been sensible to fetch all of .user if so. The query builder cannot know whether all of it will be needed or not, because Python is not a compiled language, so there's no way to tell in advance when executing the book.user.id query that no further fields of .user will be needed, so it shouldn't actually do what it's been asked to do, to fetch the entire object, but rather only fetch .id which is available in a different way, so the whole query can be skipped. So yes, this is suboptimal usage, but only the programmer can know that, so it falls to them to optimize if they want to.
Perhaps i'm a bit odd, but when I'm going to lean on an ORM to do things I expect it to actually do them. I expect that foo.user_id does not exist, because that representation has been transformed into an object. foo.user.id should be the only viable reference to the id. foo.user.id should return the value it already knows, any other property access i would expect will do the equiv of `select * from ...` if the object has not previously been populated.

Now perhaps some ORM's prefer to be thinner, to provide more footguns via a leaky abstraction that mixes implementation details with the object mapping. I don't think those are good implementaations.

> Now perhaps some ORM's prefer to be thinner, to provide more footguns via a leaky abstraction that mixes implementation details with the object mapping. I don't think those are good implementaations.

To me, it seems like the (hypothetical?) implementation you're talking about is much more leaky and footgun-y than the more straightforward ("thinner", in your words) version. In order for foo.user.id to not execute a new query, foo.user would have to return some sort of proxy object that only fetched the user row when you tried to access a field that hasn't been loaded. That's way more magic than the more obvious solution—which is to load the row when you access the related object—and could easily cause more problems than it solves in the long run when you need to debug very specific queries.

Furthermore, how is going out of your way to hide a field that exists in the database (user_id) not the definition of a leaky abstraction? What purpose does it serve to direct you through an unnecessary layer if all you need is the ID?

> I expect that foo.user_id does not exist, because that representation has been transformed into an object.

Or you can just consider user_id to be a reference pointer that is part of foo, while user.id is an attribute of user. Totally different things and I am glad that the distinction is there.

I’m sure there are valid engineering reasons to do it this way. One that comes to mind is memory footprint in allocating the objects associate with foreign key references.
Wow! Who would have thought that using an ORM would create so much uncertainty in a codebase.
Fancier ORMs will return a proxy for foo.user that doesn’t issue a query until the user asks for a property on it. And it can return user.id without querying.
But it doesn't have book.user itself. Unless you want Django to construct some empty proxy object representing book.user for this one particular optimization.

If it's a bug, sounds like a "wontfix" to me.

It's not a bug, it is the same value. Only instead of using `first_table.foreign_id` to fetch the entire record from `second_table` only to use `second_table.id`, if you only need the identifier itself you already have it in `first_table`.

A similar concept called covered queries exists whereby you index a table by foreign key, and a few additional columns that you do not expect to use in join conditions, but you do expect to frequently retrieve. Depending on your database, requests for only columns in the index (some being the join condition, some subsequent columns being in the set of popular additional columns) means faster access to those popular columns only. In the context of ORMs, you would need to do something to avoid a default behavior of "select *" in order to exploit this index.

> It's not a bug, it is the same value.

Not necessarily, it can be overridden: "id" is only the default for models that haven't explicitly been given a field with the "primary_key" kwarg (common on legacy tables where the primary key column might for example be "user_id").

The alias guaranteed to be the same value is ".pk", and I'm not sure what django does if you try to create a column named "pk" that isn't the primary key.

If your tables don't use id, or are doing something strange, YOU wrote that code. You should know it's going to be a thing you need to deal with, because YOU did something unusual.
Note that I gave the example of "legacy" tables. On long-lasting codebases there's a high likelihood that "you" didn't write that code and aren't aware of everything Django's doing under the hood.
Because when you are acessing .user you are asking for all its properties, not just the id.

Django does provide relatively easy ways to get over the N+1 issue, though. If you do Book.objects.select_related('user'), only one query is made.

`QuerySet.select_related()` and `QuerySet.prefetch_related()` are the bread and butter of Django query optimisation. I think most of the time that I've noticed a performance issue in our code, it's been easily fixed with one of those.
Django's ORM gets a lot of flak, but I don't remember the last time I had complex queries that I could not do with it.

You still need to understand a minimum of SQL and databases, and usually those that complain about the ORM are the ones that expect it to be a "sufficiently advanced compiler", but it has matured so much that nowadays the developers consider a *bug* every time the answer to How do I do this query X? involves something along the lines of use .extra or raw sql.

This is true, though to be fair to the critics, the syntax through which you express these complex queries is often clunky and unintuitive. For example, I need to re-read the documentation every time I use the annotation API because it's generally not obvious how to use it, and I've run into a few edge cases where you need extra code/syntax just to deal with its nuances and ambiguities.

Even though Django has come a long way, I greatly prefer ORMs like SQLAlchemy and Ecto that map more closely to the SQL query I'm trying to write.

It would return the same value, but the approach to obtain it would be different.

`user` would be a property defined as a User object on the Book model, so accessing `book.user` will cause the framework to fetch the entire user model (even if we then only fetch the id).

On the other hand, `book.user_id` is the auto-generated database column, generated to make the above property definition possible. But since this `user_id` is directly defined on the book object, there is no need to query the user table.