Hacker News new | ask | show | jobs
by e12e 2459 days ago
Somewhat on the side, but:

> By using the includes method we've been able to knock our queries down from six to two: The first to find the events, and the second to find the categories for those events.

Am I the only one that thinks rails/activerecord is crazy here for (often) preferring two trips to the db over a join on indexed primary keys?

Are there really (sane) use-cases where that is better?

5 comments

>Am I the only one that thinks rails/activerecord is crazy here for (often) preferring two trips to the db over a join on indexed primary keys?

You mean

  SELECT u.* , c.* FROM users u INNER JOIN comments c
    ON u.id = c.user_id WHERE u.id = <...>
?

If the average user record includes a non-trivial amount of data and there are many comments, you're repeating a lot of information in the query results.

disclaimer: I'm currently doing some work for pganalyze.com (the blog host)

Yes.

I'm not sure about "non-trivial amount of data" - especially if we're talking about a single row (single user) here? I suppose there might be a few blolbs of large json documents?

And I'm not sure I understand "repeating information in query results" - surely we're talking about receiving a query, transforming it and responding with the data?

In a rails+graphql ideal world, that'd maybe mean Parsing the json query, generate some simple ruby (active record) code, fire it off to the db (let ar /arel do its job; generate sql) - serialize to json and write the response?

If there truly is non-trivial data in fields not requested, it might be worth it to naemrrow the columns selected. But I doubt it, in the general case.

>I'm not sure about "non-trivial amount of data" - especially if we're talking about a single row (single user) here? I suppose there might be a few blolbs of large json documents? > >And I'm not sure I understand "repeating information in query results" - surely we're talking about receiving a query, transforming it and responding with the data?

For a query like

  SELECT u.* , c.* FROM users u INNER JOIN comments c
    ON u.id = c.user_id WHERE u.id = <...>
above, the user information is repeated for each comment, e.g.,

  u.id, u.name, c.id, c.text
  1     Bob     1     you are right
  1     Bob     2     you are wrong
  1     Bob     3     you are right again
If you have a hundred comments, and many user fields, this can be a significant amount of data to pass around. Or maybe this is not what you meant? I feel like we may be talking past each other. Can you give an example of what you think the ideal sql should be for "give me user with id = 1 and all their comments"?
No, you're probably right - I suppose the database drivers might naively serialize that in a one-to-one representation (to be honest, I've never really looked into the protocols that postgres or Ms sql use for transferring data).

I didn't consider the kind of expansion you're alluding to - on the other hand i can't ever remember seeing an indication that the amount of data returned is an issue with app performance.

But I've certainly seen the number of round-trips result in real issues. And complicated joins, for that matter.

I've worked on a few Postgres drivers and the protocol is basically sending data row-by-row, column-by-column. Sure, it could be more clever, but that's time and cpu and memory spent being clever instead of moving data.

But you're right, there are always trade-offs, and there are some situations where the current AR approach is not ideal, and a single join would be more efficient. However, I suspect this is the way AR does it because it's right enough often enough that it works well as the default approach.

Is the concern over repeated information about [load on the DB], [size of data sent to the client], or something like [processing requirements on the client]? I'm assuming you're talking about [size of data sent to the client], but shouldn't that data be readily-compressible?
I mostly meant data size on the wire, yeah, but these are all potential issues. In any case, the performance difference between the two approaches is going to heavily depend on your schema and your data, and I suspect the AR works well enough in practice.

And in this example, the data is definitely compressible, but the Postgres wire protocol does not implement compression (it leaves that to tls right now, though CRIME makes that a problem). Adding native compression to the protocol has been on the TODO list for a while: https://wiki.postgresql.org/wiki/Todo#Wire_Protocol_Changes_...

Joins don't scale to more than one relation.

Modify the original sample to load two relations: fetch all events with their first 5 categories and all their users. If you use two joins then you get a combinatorial explosion because you are now getting all permutations of the (event, category, user) tuple.

  (event1, category1, user1)
  (event1, category1, user2)
  (event1, category2, user1)
  (event1, category2, user2)
It's more composable with multiple includes.

For instance, if you have a simple active record type:

class User has_many :comments; has_many :likes; end

User.includes(:comments, :likes).find(user_id)

Generates 3 queries, but importantly the number of records returned is appropriate.

User.joins(:comments, :likes).where(users: { id: user_id })

Generates only a single query, but the number of records returned from this join is the product of comments*likes for this particular user.

I find the includes stuff easier to reason about as the amount of associations you need loaded in memory becomes more complex.

Hm, I see now where we're talking past each other - I was thinking more along the lines of "user has orders has order lines has products has prices" (where you might traditionally make a (hierarchy of) database view(s)).

Not "user has orders", and also "user has favorite products".

So the problem is active record is no good at joins?

(That is, it's hard/impossible via the active record DSL to make the correct nested join?)

By the way (you may know this) when staying within active record convention, you should never need to single out the ".id" and ".foreign_model_id" fields:

User.joins(:comments, :likes).where(user: some_user)

(where some_user is an instance of User).

I'm still not quite comfortable enough with AR, but I think, leaning on this: https://www.learneroo.com/modules/137/nodes/768

I think what you want in your second case is simply?:

User.joins(:comments, :likes).where(user: some_user).uniq

(I'd have to create some models and watch what ".to_sql" creates to be certain, though)

> So the problem is active record is no good at joins?

You encounter this problem whether you use an ORM or not.

To play devil's advocate you could argue many web apps are small internal tools that provide lots of business value but only have small (even single digit) user numbers and thus shouldn't optimize in the slightest for db trips (and instead optimize for maintenance or simplicity).
Just because there are few users doesn't mean inefficiency is fine.
It is because there is no reason why you should waste your time optimizing a page with a heavy query (with thousands of inserts) from 10 seconds to the theoretical 6 seconds that are possible if its only going to be used 20 times and the rest of the time it's simply loading the last cached result?

Also from my experience the performance problems are completely divorced from ORM/not ORM. Usually the reason why you drop the ORM is that you want to take advantage of features that are not modeled by the ORM. Most query optimizations simply involve trying to do everything in as big of a batch as possible. Whether you do batching with or without the ORM isn't really relevant. The reason why people don't batch their queries when they use ORMs is because it's so trivial to not do so but there is no reason why we can't build better ORMs that make batched queries as easy to use as non batched queries which would get us both performance and convenience at the same time.

On the other hand, it also does not mean that inefficiency to some degree is not acceptable.

And if it is not acceptable, you can still optimize the living excrements out of your GraphQL resolvers.

Also, the burden of doing more than one round trip usually moves to the client. So there's always some trade-off to be made.

There are different types of inefficiency and being inefficient in one area might be worth the trade off to be efficient in another. Also, multiple dB queries really aren’t the end of the world.
If you knew ahead of time the fields that the user wanted to load, by all means do a join and load the data ahead of time. This would work great for RESTful APIs... but the issue with GraphQL is that you don't know ahead of time which fields the user will request, so lazy-loading the data provides efficiency without upfront knowledge of what the user will request.
You don’t need to know what they will request, you know what they did request. The dataloader thing is a kludge because you never know where you’ll have to go to get data for a particular part of the tree - maybe it’s another endpoint.

Having said that, I don’t mind the second request model myself. It’s actually pretty clean and when you take the dB roundtrips out of it, the dB will generally do the same work anyway.