|
|
|
|
|
by crazygringo
510 days ago
|
|
> However, from an efficiency standpoint both for the application and the DB the most efficient action is to first request from the parent table and then turn around and, in parallel, send out requests for the child tables of the various types as needed. This is not true. It is generally considered an anti-pattern. The fundamental reason it is not true is because it is generally orders of magnitude faster for the DB to do a join or subquery within the same query, rather than perform an entire query, output it, transfer potentially large amounts of data across the network, process that, do the whole thing in reverse, etc. I don't know how you learned that queries should be split up like that, but it is generally horrendous from a performance standpoint. There is no "correct" way to do it that can compensate for the massive overhead. The correct way is to do it all in a single query with joins and subqueries (including possibly correlated subqueries) whenever possible. Perhaps you learned this pattern from a services architecture, where it is correct because the pieces of data all sit in different services. But when all the data resides in the same database, splitting up queries is not generally something you want to do unless circumstances force you to. |
|
By who?
> it is generally orders of magnitude faster for the DB to do a join or subquery within the same query, rather than perform an entire query, output it, transfer potentially large amounts of data across the network, process that, do the whole thing in reverse, etc.
If you are talking about a straight relationship, then yes, this is true.
IE
However, when you start talking about the exact scenario I put forward it becomes slower. Primarily because the single thread/connection handling the request also has to store sometimes a non-trivial amount of data in memory while it is doing all the query processing. Especially with some conditional logic in there that makes it hard for the db to immediately return until after it's collected the entire dataset.This becomes extra true when you start dealing with wide datasets that have a large amount of reuse throughout the dataset.
If Foo has a Bar, and there are 10 million foo and 1000 Bar used throughout them, then it's faster, less network, and less data intense to load up bar separately from foo.
> I don't know how you learned that queries should be split up like that
Profiling, benchmarking, and common sense. Furthermore, if you look at how NoSQL Dbs operate, you'll notice they all came to exactly the same conclusion WRT performance.
> but it is generally horrendous from a performance standpoint.
Says you.
> There is no "correct" way to do it that can compensate for the massive overhead. The correct way is to do it all in a single query with joins and subqueries (including possibly correlated subqueries) whenever possible.
You are VASTLY overestimating the "massive overhead" of separate queries. Assuming your DB isn't in literally another country, it'll be ms at most for the round trip. For some applications dealing with small amount of data that may be a game changer, but for the apps I work on and their scenario that's peanuts compared to the actual data fetching time.
With our joining tables, we aren't sending back the full dataset. We are sending back the ids to load which correspond to the subtypes to load. You can stuff a ton of 64bit values into 1KB on the request. Further, there are ways to structure those 64bit values with some RMDBS to correspond them with the sub tables clustered index (which we do).
The only added overhead is resending the ids. Which, again, I'll point out is peanuts in a modern networking setup. I've benchmarked it, what actually takes the most time in terms of db interaction is authentication when a connection needs to be redone. Everything else pales in comparison.
> But when all the data resides in the same database, splitting up queries is not generally something you want to do unless circumstances force you to.
Correct, in the exact scenario I laid out circumstances force us to. These are not narrow tables, few types, or small datasets that we are dealing with.
But even if they were, there's legitimate reasons to consider this approach. For example, when the joined table is mostly static and highly cacheable then it'd make sense splitting it from a general join to store off in a cache store. That's sort of the whole reason memcache exists as a product.