Hacker News new | ask | show | jobs
by BeefWellington 1387 days ago
> Personally, I typically would rather have the application layer do the validation and even the joining of data, a lot of the time, when the application is high-volume. At the volumes my organization works with, it is very difficult to write performant SQL queries that use JOINs and other relationships as a developer - even as a DBA! - and often much easier, for me, to write performant application code.

There is basically no way it is faster for "high-volume" systems to return excess data to the application rather than doing the joins on the dB and returning the record set to the application.

Even if we were talking about multiple billions of records you'd still be better off with a completely denormalized data warehouse style table and doing filtering against indexed columns db-side before sending vast quantities of data to the application.

I think people think this way because of licensing and the specialized nature of DBAs.

1 comments

We are absolutely talking about billions of records.

But when you say "return excess data to the application" I'm not sure what you mean. Not doing lots of complex JOINs doesn't mean not filtering the queries at the DB at all. Nobody is pulling back a billion records at a time.

Here's an example of what I'm talking about - read the first comment on https://www.brentozar.com/archive/2015/05/do-foreign-keys-ma..., another venue for this same debate.

> We are absolutely talking about billions of records.

> But when you say "return excess data to the application" I'm not sure what you mean. Not doing lots of complex JOINs doesn't mean not filtering the queries at the DB at all. Nobody is pulling back a billion records at a time.

The question then becomes "Are you gaining anything by not using foreign keys on the database?" What is the additional speed impact of JOINs actually costing you? How denormalized is your database already, if JOINs are costly?

If you want raw speed at billions-of-records scale, you want as flat a schema as you can get and good indexes are actually going to fit into RAM.

By that point though, you should be able to recognize your use case is not the 90% (or even 95%) case, and your specific requirements are driving doing something different. That's very different than the vague "high-volume" statement you made at first.

My experience has been in assisting clients in migrating to data warehouses and specifically during the heyday of Hive/Hadoop/Spark years ago in seeing clients mistakenly believe they were "big data" and go down the rabbit hole of trying to scale out before it was actually necessary. The problem I have with the vague notion of "high-volume" is that I saw clients with 500m records believe they fit the bill, as well as clients with as few as 20m records who thought the same. The reality is neither of them did and they wound up wasting a lot of money in pursuit of slower systems.

> Here's an example of what I'm talking about - read the first comment on [...], another venue for this same debate.

That's not an example though, that's just some vague statements about needing to profile your query and evaluate the costs for yourself, which should be rather obvious.

I'll cop to my situation being unusual, but that's what I said in response to other posts in this thread: so is everyone's. This isn't a topic with a blanket rule. I loaded my original post with caveats like "for me" to try to make that clear. Sorry if it wasn't.

Apart from your skepticism about whether my org's DB is as efficient as could be, I don't think we actually disagree, unless your argument is that people in my situation are somehow obligated to make a specific set of choices rather than what works for them.