Hacker News new | ask | show | jobs
by pocketarc 511 days ago
This is actually an incredible way of articulating something that's been on my mind for quite a while. Thank you for this, I will use this.

The received wisdom is, of course, to lean on the DB as much as possible, put all the business logic in SQL because of course the DB is much more efficient. I myself have always been a big proponent of it.

But, as you rightly point out, you're using up one of your infrastructure's most scarce and hard-to-scale resources - the DB's CPU.

5 comments

I think there are two different concerns here though:

The article recommends something that may lead to using the wrong query plans. In the "right" conditions, you will do full table scans of all your data for every query.

This is making the DB waste a lot of CPU (and IO).

Wasting resources like that is different from just where to do work that has to be done anyway!

I am a proponent of shifting logic toward the DB, because likely it ends up there anyway and usually you reduce the resource consumption also for the DB to have as much logic as possible in the DB.

The extreme example is you want to sum(numbers) -- it is so much faster to sum it in one roundtrip to the DB, than to do a thousand roundtrips to the DB to fetch the numbers to sum them on the client. The latter is so much more effort also for the DB server's resources.

My point is: Usually it is impossible to meaningfully shift CPU work to the client of the DB, because the client needs the data, so it will ask for it, and looking up the data is the most costly operation in the DB.

The answer is "it depends".

Sum is a good thing to do in the Db because it's low cost to the db and reduces io between the db and app.

Sort can be (depending on indexes) a bad thing for a db because that's CPU time that needs to be burned.

Conditional logic is also (often) terrible for the db because it can break the optimizer in weird ways and is just as easily performed outside the db.

The right action to take is whatever optimizes db resources in the long run. That can sometimes mean shifting to the db, and sometimes it means shifting out of the db.

It's hard to think of situations where you don't want to do the sorting on the DB. If you're sorting small numbers of rows it's cheap enough that it doesn't matter, and if you're sorting large numbers of rows you should be using an index which makes it vastly more efficient than it could be in your app.

And if your conditional logic is breaking the optimizer then the solution is usually to write the query more correctly. I can't think of a single instance where I've ever found moving conditional logic out of a query to be meaningfully more performant. But maybe there's a specific example you have in mind?

> if you're sorting large numbers of rows you should be using an index

Perhaps, depends on what the table is doing and needs to be optimized for.

Indexes are not free, they have a write penalty as they need to be updated every time the data in the index is updated.

> I can't think of a single instance where I've ever found moving conditional logic out of a query to be meaningfully more performant. But maybe there's a specific example you have in mind?

Certainly.

In one of our applications we effectively represent the types as subtables with a root table for the parent type. There were roughly 10 different types with different columns per type.

One way the queries were written, which is slow, was that on insertion the client app would send in (effectively) a block of data with all columns for these types to insert. In the database, the conditional logic would pull out the type id from the input and make the decision on that type information for which subtable would be inserted.

There's really no way to make this something the SQL optimizer can well consume.

The right solution was to instead break this up in the application and per type do the insertions directly into the table type in question. It simplified both sides of the code and ran faster.

I agree that for a) inserts and b) single entity access -- in both these cases the backend can do a lot of the preparation. And your example is both a) and b). We are then in O(1) optimization territory.

If you are only processing a single entity -- the backend should tell the DB exactly what to do. And one shouldn't have if-statements in SQL of course that is "doing it wrong".

But if you have a chunk of 10000 entities like that in your example, all of different types, then you will have to insert some subset of data into all those tables (1000 in one tables, 500 another table, and so on). That logic is well suited for where conditions without much overhead.

But yes for inserts most of the logic can usually be shifted to the DB client as that is where the data resides already. The problem I was talking about was meaningfully shifting for to the client for queries, where the client has no data to work with and must fetch it from the DB.

Let us take your example and turn it into "fetch 10000 such objects". Fetching the right rows for all of them at once using joins and where conditions (+temporary tables and multiple return sets in the same query roundtrip) is going to be more efficient for the DB than the backend first fetching the type, then branching on type, then fetching from another table and so on.

> Fetching the right rows for all of them at once using joins and where conditions (+temporary tables and multiple return sets in the same query roundtrip) is going to be more efficient for the DB than the backend first fetching the type, then branching on type, then fetching from another table and so on.

Nope, not if done correctly.

Now, this isn't to say there's not valid reasons to do it all at once in the DB, the chief among them being ACID requirements. However, from an efficiency standpoint both for the application and the DB the most efficient action is to first request from the parent table and then turn around and, in parallel, send out requests for the child tables of the various types as needed.

Assuming you have a connection pool, the overhead of doing multiple requests in parallel is small. The DB has less data to lookup. The DB has less temporary memory to store (which in our case was a problem). The response io is smaller (not a bunch of empty columns sent back) and both the DB and the downstream application are capable of querying against these tables in parallel.

There is a latency downside in needing the load up the parent table first, if the datasize is large enough then you could overcome that problem by making batch requests to the DB as the parent dataset comes back. Say every 1k values of a given type start the parallel request to load that data.

Splitting the request into these smaller and parallel requests also has systemic benefits to the DB, new writers are able to sneak in which isn't possible when you try to do everything at one go (another issue we had).

The added benefit here is the optimizer in the DB is more likely to do the right thing for the subtable requests than it is for the temp table mess request. A simple fetch is far easier to optimize than a complex one.

> The received wisdom is, of course, to lean on the DB as much as possible, put all the business logic in SQL because of course the DB is much more efficient. I myself have always been a big proponent of it.

This is not received wisdom at all and the one edict I have when leading a project is no stored procedures for any OLTP functionality.

Stored Procs make everything about your standard DevOps and SDLC process harder - branching, blue green deployments and rolling back deployments.

>Stored Procs make everything about your standard DevOps and SDLC process harder - branching, blue green deployments and rolling back deployments.

There is a naming/namespacing strategy incorporating a immutable version identifier that makes this easier, which I have described here:

https://news.ycombinator.com/item?id=35648974

Note that this requires a strategy for cleaning up old procedures.

It also is possible to individually hash each procedure, which is more sophisticated, and would allow for incremental creation of new procedures.

That’s actually an ingenious solution. I can’t find any flaws in it.
There's a different reason to lean on the DB: it's the final arbiter of your data.

Much harder to create bad/poisoned data if the DB has a constraint on it (primary, foreign, check, etc) than if you have to remember it in your application (and unless you know what serializable transactions are, you are likely doing it wrong).

Also you can't do indexes outside of the DB (well, you can try).

Replying to this whole sub-thread, not just this post specifically;

All SQL advice has to take _context_ into account. In SQL, perhaps more than anywhere else, context matters. There's lots of excellent SQL advice, but most of it is bound to a specific context, and in a different context it's bad advice.

Take for example the parent comment above; In their context the CPU of the database server is their constraining resource. I'm guessing the database if "close" to the app servers (ie low network latency, high bandwidth), and I'm also guessing the app developers "own" the database. In this context moving CPU to the app server makes complete sense. Client-side validation of data makes sense because they are the only client.

Of course if the context changes, then the advice has to change as well. If the network bandwidth to the server was constrained (cost, distance etc) then transporting the smallest amount of data becomes important. In this case it doesn't matter if the filter is more work for the server, the goal is the smallest result set.

And so it goes. Write-heavy systems prefer fewer indexes. Read-heavy systems prefer lots of indexes. Databases where the data client is untrusted need more validation, relation integrity, access control - databases with a trusted client need less of that.

In my career I've followed a lot of good SQL advice - advice that was good for my context. I've also broken a lot of SQL "rules" because those rules were not compatible, or were harmful, in my context.

So my advice is this - understand your own context. Understand where you are constrained, and where you have plenty. And tailor your patterns around those parameters.

The most sensible and pragmatic advice in this thread.
I think the conventional wisdom is to lean on the DB to maintain data integrity, using constraints, foreign keys, etc.; not to use it to run actual business logic.
This is Brent Ozar's old theme.