Hacker News new | ask | show | jobs
by ivraatiems 1387 days ago
I'm going to assume that by "foreign keys", you mean "foreign key constraints" where the DB itself is insisting on particular relationships.

There are a few different schools of thought. I will list them, but the important thing to remember is not to be dogmatic. They are all right or wrong depending on your circumstance.

One school of thought says "I want all data in my DB to be normalized. I want it to be right when it goes in so it never breaks the application layer." That school would say foreign key constraints are critical.

Another school of thought says "I want all the data in my DB to be retrieved and inserted quickly. I want the application layer to do any error-checking that is necessary, or, I want to be in a situation where I can always fail gracefully if there's errors in data validation."

Still another school of thought says "I don't trust those programmers to write good application code, so I will insist on normalized data for that reason," and yet another says, "I don't have control over the DB schema, that's some DBA's job, so I will just do all my validation in-app."

The point of this being, there's tradeoffs either way you go.

Personally, I typically would rather have the application layer do the validation and even the joining of data, a lot of the time, when the application is high-volume. At the volumes my organization works with, it is very difficult to write performant SQL queries that use JOINs and other relationships as a developer - even as a DBA! - and often much easier, for me, to write performant application code. The DB is also a pet with many owners, whereas the infra for my applications is owned by my team. So, it's better for me to do relationship validation in code myself. (We also do not use a heavy ORM, again for performance reasons. Just Dapper.)

At my previous job, the situation was the opposite - we weren't under such load at most times that it mattered if the queries were performant, we had Entity Framework building relationship, and EF will blow up if you ask it to build relationships where none exist. So, we needed more normalized data, and that was what we went for. But even then, not in every situation.

3 comments

> At the volumes my organization works with, it is very difficult to write performant SQL queries that use JOINs and other relationships as a developer - even as a DBA! - and often much easier, for me, to write performant application code.

How can this possibly be true? Won't that result in sending unnecessary data over the wire, stressing network and SQL buffer?

What are these queries and what are these volumes? I just can't wrap my head around the performance statement. You know better join algorithms that SQL Server is capable of (Loop/Hash/Merge)? Given that you have appropriate indexes in place, perhaps query hints is what you need to control sql plan guides if you know a thing or two about your data and it's distribution more than the sql statistics.

Joins are severely misunderstood and often incorrectly used. I've seen code that had the wrong join and would return 1,000s of rows which then had to be programatically squished down into the data we actually wanted. Some hand crafted SQL usually fixes this

Also some ORMs write dreadful SQL where it comes to joins

A badly written join (or collection of joins) will take a longer time to run that will, when the system is under load, backlog other queries. If these errant queries make up a significant portion of your queries then it will hit performance significantly

It's not the joins themselves just the incorrect use of them

You said what I'm trying to say way more clearly than I did.

A lot of people have the same concern but I'm just gonna reply to this comment.

The ratio of SQL-focused devs to non-SQL-focused devs at my org is not favorable. And we certainly DO write joins... just not complex ones. Likewise, we do use constraints... just not all the time against multi-billion-record tables.

But that's not all. Our biggest tables are also our oldest and most unwieldy. Here is an (admittedly outdated in the specifics) example of what it's like to add constraints to a big table in SQL server that didn't have them already: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/3eb...

So for a dev to come along and add a whole bunch of FK relationships and/or write some big fun queries against one of these tables is asking for a lot more than one realizes immediately. New devs join and run up against this all the time.

Is it a good or great situation? No. But that's not the question. The question is is it a real one and why.

Joins may be severely misunderstood by some people that write front end, middleware and backend code in the same day, but not by dedicated SQL developers. If the app is big enough and important enough, having dedicated SQL devs is the solution.
I'm not GP so I don't know what they meant, but a key upside to complex logic in the application layer vs the database is that the application layer is often much easier to scale out than the db. Where I work, if I run out of memory in the app I just change a configuration variable and k8s gives me more instances instantly. But if the database is memory constrained and I'm already on the biggest server available to me...I need to re-shard or take some other more sophisticated approach (tuning, replication, other stuff I don't know about).

Of course each scenario is different, YMMV, and as always "it depends".

> key upside to complex logic in the application layer vs the database is that the application layer is often much easier to scale out than the db

I think the point of the GP is that all of these application instances are still connected to the DB, doing sub optimal data fetches taxing the database in multiples.

One example: MySQL table locks. The application can do in parallel what the database can't.
Locks ensure you get committed data back and not some data that is in-operation and may or may not end being persisted. If you don't care about it: SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;

And SQL can do parallel operations too.

You made the parent commenter's point for them. You went over the heads of half the developers with join algorithms and index hints. That's just how it is, unless you're at a company with a very high bar for hiring and training.
But, besides index hints, the developers don't need to worry about those things if they use the database to perform joins. The database management system chooses for them and does it pretty well (counterproductive index hints are not unheard of).

If they do it in application code, then they probably ought to learn about fancy sorting and joining algorithms.

But they should really just do it in the database (using read only replicas if the load gets high).

Using read replicas isn't always an option depending on your write throughput and consistency requirements.

The database can only do so well (and will spend a lot of CPU cycles working on your crazy query plan, because getting it wrong is more expensive, so now you effectively limit capacity regardless of how good your storage engine is).

Joins are great, tons of research went into making joins work, and lots of different join algorithms and optimizations based on data sizes, indexes, etc. But you really have to be careful, verses just denormalizing data across multiple tables/collections. Most applications are read-heavy, anyway... I generally plan for things to be successful, in which case joins don't usually work in the hot path.

> At the volumes my organization works with, it is very difficult to write performant SQL queries that use JOINs and other relationships as a developer - even as a DBA!

Leaving aside my initial snark reactions¹ as they are not really relevant.

What you say may be true if the data is not arranged in a manner conducive to efficient queries of the type you are trying to make (for instance if the DB was optimised for a different sort of output because the needs were (or were expected to be) different at design time. BUT, read performance is not relevant to foreign keys. A constraint is assessed as INSERT/UPDATE/other time to maintain referential integrity and has no effect on later reads. You can do your own linking in the application if you want, but I'm keeping my foreign keys to stop bad data getting in - they won't affect your process of getting data out either way.

Also note that a foreign key does not imply an index exists in most DBMSs⁴ so if you are expecting the constraint to help performance when referring to a table from its parent then you may be disappointed. An index will exist where the key is referring to as FKs will always refer to a primary key or unique index but the other side is not usually indexed unless you explicitly ask for it to be. I've seen a few people run into this trap, expecting an index to be there because an FK constraint is, and coming to the conclusion that JOINs are just slow because one isn't so their queries that would benefit from it are slow.

----

[1] Sorry, not a good enough person: “sounds like you need a better DBA!”²

[2] Well, a better database developer. Even the best can't get good performance from an inappropriate design. Or maybe a time machine, everyone who has worked with BDs long enough will have been stuck with bad or inappropriate design³ we have no power or time to fix…

[3] Possibly of our own making!

[4] Some DBs create one automatically, and some ORMs & other data manipulation libraries built on top do too. But it is generally not done because it is far from always necessary, so it could waste space, and you may want a compound index instead depending on other properties of the data ans desired outputs.

> Personally, I typically would rather have the application layer do the validation and even the joining of data, a lot of the time, when the application is high-volume. At the volumes my organization works with, it is very difficult to write performant SQL queries that use JOINs and other relationships as a developer - even as a DBA! - and often much easier, for me, to write performant application code.

There is basically no way it is faster for "high-volume" systems to return excess data to the application rather than doing the joins on the dB and returning the record set to the application.

Even if we were talking about multiple billions of records you'd still be better off with a completely denormalized data warehouse style table and doing filtering against indexed columns db-side before sending vast quantities of data to the application.

I think people think this way because of licensing and the specialized nature of DBAs.

We are absolutely talking about billions of records.

But when you say "return excess data to the application" I'm not sure what you mean. Not doing lots of complex JOINs doesn't mean not filtering the queries at the DB at all. Nobody is pulling back a billion records at a time.

Here's an example of what I'm talking about - read the first comment on https://www.brentozar.com/archive/2015/05/do-foreign-keys-ma..., another venue for this same debate.

> We are absolutely talking about billions of records.

> But when you say "return excess data to the application" I'm not sure what you mean. Not doing lots of complex JOINs doesn't mean not filtering the queries at the DB at all. Nobody is pulling back a billion records at a time.

The question then becomes "Are you gaining anything by not using foreign keys on the database?" What is the additional speed impact of JOINs actually costing you? How denormalized is your database already, if JOINs are costly?

If you want raw speed at billions-of-records scale, you want as flat a schema as you can get and good indexes are actually going to fit into RAM.

By that point though, you should be able to recognize your use case is not the 90% (or even 95%) case, and your specific requirements are driving doing something different. That's very different than the vague "high-volume" statement you made at first.

My experience has been in assisting clients in migrating to data warehouses and specifically during the heyday of Hive/Hadoop/Spark years ago in seeing clients mistakenly believe they were "big data" and go down the rabbit hole of trying to scale out before it was actually necessary. The problem I have with the vague notion of "high-volume" is that I saw clients with 500m records believe they fit the bill, as well as clients with as few as 20m records who thought the same. The reality is neither of them did and they wound up wasting a lot of money in pursuit of slower systems.

> Here's an example of what I'm talking about - read the first comment on [...], another venue for this same debate.

That's not an example though, that's just some vague statements about needing to profile your query and evaluate the costs for yourself, which should be rather obvious.

I'll cop to my situation being unusual, but that's what I said in response to other posts in this thread: so is everyone's. This isn't a topic with a blanket rule. I loaded my original post with caveats like "for me" to try to make that clear. Sorry if it wasn't.

Apart from your skepticism about whether my org's DB is as efficient as could be, I don't think we actually disagree, unless your argument is that people in my situation are somehow obligated to make a specific set of choices rather than what works for them.