| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dagss 510 days ago

I agree that for a) inserts and b) single entity access -- in both these cases the backend can do a lot of the preparation. And your example is both a) and b). We are then in O(1) optimization territory.

If you are only processing a single entity -- the backend should tell the DB exactly what to do. And one shouldn't have if-statements in SQL of course that is "doing it wrong".

But if you have a chunk of 10000 entities like that in your example, all of different types, then you will have to insert some subset of data into all those tables (1000 in one tables, 500 another table, and so on). That logic is well suited for where conditions without much overhead.

But yes for inserts most of the logic can usually be shifted to the DB client as that is where the data resides already. The problem I was talking about was meaningfully shifting for to the client for queries, where the client has no data to work with and must fetch it from the DB.

Let us take your example and turn it into "fetch 10000 such objects". Fetching the right rows for all of them at once using joins and where conditions (+temporary tables and multiple return sets in the same query roundtrip) is going to be more efficient for the DB than the backend first fetching the type, then branching on type, then fetching from another table and so on.

1 comments

cogman10 510 days ago

> Fetching the right rows for all of them at once using joins and where conditions (+temporary tables and multiple return sets in the same query roundtrip) is going to be more efficient for the DB than the backend first fetching the type, then branching on type, then fetching from another table and so on.

Nope, not if done correctly.

Now, this isn't to say there's not valid reasons to do it all at once in the DB, the chief among them being ACID requirements. However, from an efficiency standpoint both for the application and the DB the most efficient action is to first request from the parent table and then turn around and, in parallel, send out requests for the child tables of the various types as needed.

Assuming you have a connection pool, the overhead of doing multiple requests in parallel is small. The DB has less data to lookup. The DB has less temporary memory to store (which in our case was a problem). The response io is smaller (not a bunch of empty columns sent back) and both the DB and the downstream application are capable of querying against these tables in parallel.

There is a latency downside in needing the load up the parent table first, if the datasize is large enough then you could overcome that problem by making batch requests to the DB as the parent dataset comes back. Say every 1k values of a given type start the parallel request to load that data.

Splitting the request into these smaller and parallel requests also has systemic benefits to the DB, new writers are able to sneak in which isn't possible when you try to do everything at one go (another issue we had).

The added benefit here is the optimizer in the DB is more likely to do the right thing for the subtable requests than it is for the temp table mess request. A simple fetch is far easier to optimize than a complex one.

link

crazygringo 509 days ago

> However, from an efficiency standpoint both for the application and the DB the most efficient action is to first request from the parent table and then turn around and, in parallel, send out requests for the child tables of the various types as needed.

This is not true. It is generally considered an anti-pattern.

The fundamental reason it is not true is because it is generally orders of magnitude faster for the DB to do a join or subquery within the same query, rather than perform an entire query, output it, transfer potentially large amounts of data across the network, process that, do the whole thing in reverse, etc.

I don't know how you learned that queries should be split up like that, but it is generally horrendous from a performance standpoint. There is no "correct" way to do it that can compensate for the massive overhead. The correct way is to do it all in a single query with joins and subqueries (including possibly correlated subqueries) whenever possible.

Perhaps you learned this pattern from a services architecture, where it is correct because the pieces of data all sit in different services. But when all the data resides in the same database, splitting up queries is not generally something you want to do unless circumstances force you to.

link

cogman10 509 days ago

> This is not true. It is generally considered an anti-pattern.

By who?

> it is generally orders of magnitude faster for the DB to do a join or subquery within the same query, rather than perform an entire query, output it, transfer potentially large amounts of data across the network, process that, do the whole thing in reverse, etc.

If you are talking about a straight relationship, then yes, this is true.

    select a from foo join bar on a=b join baz on b=c

However, when you start talking about the exact scenario I put forward it becomes slower. Primarily because the single thread/connection handling the request also has to store sometimes a non-trivial amount of data in memory while it is doing all the query processing. Especially with some conditional logic in there that makes it hard for the db to immediately return until after it's collected the entire dataset.

This becomes extra true when you start dealing with wide datasets that have a large amount of reuse throughout the dataset.

If Foo has a Bar, and there are 10 million foo and 1000 Bar used throughout them, then it's faster, less network, and less data intense to load up bar separately from foo.

> I don't know how you learned that queries should be split up like that

Profiling, benchmarking, and common sense. Furthermore, if you look at how NoSQL Dbs operate, you'll notice they all came to exactly the same conclusion WRT performance.

> but it is generally horrendous from a performance standpoint.

Says you.

> There is no "correct" way to do it that can compensate for the massive overhead. The correct way is to do it all in a single query with joins and subqueries (including possibly correlated subqueries) whenever possible.

You are VASTLY overestimating the "massive overhead" of separate queries. Assuming your DB isn't in literally another country, it'll be ms at most for the round trip. For some applications dealing with small amount of data that may be a game changer, but for the apps I work on and their scenario that's peanuts compared to the actual data fetching time.

With our joining tables, we aren't sending back the full dataset. We are sending back the ids to load which correspond to the subtypes to load. You can stuff a ton of 64bit values into 1KB on the request. Further, there are ways to structure those 64bit values with some RMDBS to correspond them with the sub tables clustered index (which we do).

The only added overhead is resending the ids. Which, again, I'll point out is peanuts in a modern networking setup. I've benchmarked it, what actually takes the most time in terms of db interaction is authentication when a connection needs to be redone. Everything else pales in comparison.

> But when all the data resides in the same database, splitting up queries is not generally something you want to do unless circumstances force you to.

Correct, in the exact scenario I laid out circumstances force us to. These are not narrow tables, few types, or small datasets that we are dealing with.

But even if they were, there's legitimate reasons to consider this approach. For example, when the joined table is mostly static and highly cacheable then it'd make sense splitting it from a general join to store off in a cache store. That's sort of the whole reason memcache exists as a product.

link

crazygringo 509 days ago

You seem to be taking some kind of extreme example of a particular use case of yours and trying to make it a general rule. But it's not.

And none of this is "says me", it's standard practice, it's relational databases 101. And none of this is about NoSQL, it's about relational databases. NoSQL performance can be abysmal for trying to do things relational databases are meant for.

And the overhead is not about network latency, it's about all of the overhead involved in serializing, transferring, deserializing, and then doing it all over again in the other direction.

Your comment seems to boil down to:

> If Foo has a Bar, and there are 10 million foo and 1000 Bar used throughout them, then it's faster, less network, and less data intense to load up bar separately from foo.

I assume you're not retrieving 10 million Foo for the user, god forbid -- you're retrieving 20 or 50 or something user-friendly. Then you should join to Bar. It is slower and more overhead to load up Bar separate from Foo. It is an anti-pattern.

If you are getting results to the contrary, your query may not be written correctly -- e.g. you are joining 10 million rows of Foo to Bar in a subquery without a WHERE clause, and then only applying the WHERE at a higher level (in which case one solution is to move the WHERE clause into the subquery). Or your tables may not be architected suitably for the queries you need to perform, and you need to revisit your normalization strategy.

Again, there are super-super-complex queries where yes it becomes necessary to split them up. But that is not the "rule", it is not the starting point -- it is what you do only when you've exhausted all possible options of keeping it in the query. It is never a recommendation of how to use databases in a general sense, which is what you are suggesting.

link

cogman10 509 days ago

> use case of yours and trying to make it a general rule

This is a fair critique. Definitely our system is a bit unique in what it works with and the amount of random data it needs to pull together.

> it's about all of the overhead involved in serializing, transferring, deserializing, and all the way back.

Serializing and deserializing are typically not a huge cost in DB communications. Most DB protocols have binary data transfer protocols which minimize the amount of effort on server or client side needed to transform the data into native language datatypes. It's not going to be a Json protocol.

Transfer can be a problem, though, if the dataset is large.

> I assume you're not retrieving 10 million Foo for the user, god forbid

In our most extreme cases, yeah we are actually pulling 10 million foo. Though a lot of our ETL backend is where these big data requests are happening as the upstream data is being processed. That's primarily where I end up working rather than the frontend service.

And I'll agree with you. If you are talking about requests which result in the order of 10 to 100 items then yes, it's faster to do that all within the database. It depends (which is what I've been trying to communicate throughout this thread).

> you are joining 10 million rows of Foo to Bar in a subquery without a WHERE clause

No, properly formed SQL. The issue is the mass of data being transferred and, as I mentioned earlier, the temporary memory being stored in the DB while it waits to transfer everything to the application.

Splitting things into the smaller and multiple queries ends up being faster for us because the DB doesn't end up storing as much temp data, nor does it end up serializing a bunch of `null` values which ultimately take up a significant chunk of the transfer.

Also, you should recognize that now you are talking about query structure that it's not universal on what's the best/fastest way to structure a query. What's good for postgresql might be bad for mssql.

link