Hacker News new | ask | show | jobs
by mtts 1387 days ago
Fear of RDBMSes is quite common. I used to suffer from it too. It’s just so annoying to have to switch your brain to a different programming paradigm every time you need to do something with the database that you start to make up all sorts of excuses as to why it’s really just better to “do it in the code”. Your coworkers argument about FKs making data migrations difficult is one of them.

Another classic is the “joins are slow” argument, which I believe goes back to a period in the late 1990s when in one, not highly regarded at the time, database, namely MySQL, they were indeed slow. But the reason “everyone” knew about this was precisely the oddness of this situation: in fact RDBMSes are highly optimized pieces of software that are especially good at combining sets of data. Much better than ORMs, anyway, or, god forbid, whatever you cobble together on your own.

There is, in my mind, only one valid reason to not use foreign keys in a database schema. If your database is mostly write only, the additional overhead of generating the indexes for the foreign keys may slow you down a little (for reading, these very same foreign keys in fact speed things up quite considerably). Even in such a case, however, I’d argue you’re doing it wrong and there should be a cache of some sort before things are written out in bulk to a properly setup RDBMS.

11 comments

> Much better than ORMs

I recently migrated to EntityFramework Core (from the non-core version) and I’m actually impressed. Most SQL is pretty much what I’d write by hand.

Now granted, if there are complex joins, subqueries and stuff, I don’t even try wrangling the ORM to somehow give me that output, but still. I feel more comfortable just using EF than I used to.

My main problem with Entity Framework is the magic underneath.

Like simple operation

    x = Ef.Find(xid)
    x.Name = "something"
    y = Ef.Find(xid)

what is y.Name ? Even though you didn't save anything to the database yet ? And the second Find didn't actually refresh from the database ?

Oh and the random bugs where people improperly include related entities but it somehow ends up working because they are automatically added as you're firing off other related queries, until eventually it does not (usually in production only).

It's a really really complex system designed to look simple and pave over important details with "works most of the time" defaults.

Once you move beyond trivial cases you really need to spend time understanding the principles behind the ORM you're using. They are always a very leaky abstraction, there is not really a way around that.

In this case the important part to know is that the DbContext represents the unit of work and "knows" Entities you previously queried on it. That's very useful, but also can hide bugs like you mentioned with the Includes. I do wish that you'd get more obvious errors if you forget an include, this can be really annoying to debug especially if you're new to EF Core. For read queries I mostly use Select instead of Include, which I find easier and more straightforward in most cases.

ORMs are really useful for making very common operations easy and for making stuff composable. They're also very complex and to make the best use of them you do need to understand both SQL and some basics on how your specific ORM generates this SQL.

>They're also very complex and to make the best use of them you do need to understand both SQL and some basics on how your specific ORM generates this SQL.

I think the biggest pitfall is how it maps object model to SQL.

The thing people fear about SQL query generation - IMO it's a non issue - when you identify hotspots you write your query manually, tools for that are there, it's easy to do retroactively and >90% of the code won't be the critical path.

DbContext is basically shared mutable state between your entire execution scope, and worst of all it makes it non-obvious.

> Once you move beyond trivial cases you really need to spend time understanding the principles behind the ORM you're using. They are always a very leaky abstraction, there is not really a way around that.

This is why I avoid ORMs in favor of writing SQL queries manually: I only need to understand one complex system for non-trivial cases instead of two.

(To be fair, I haven’t done any database programming for a few years. ORMs may have significantly improved since I last looked at them.)

They are always a very leaky abstraction, there is not really a way around that.

Editing in general is hard. E.g. if in a form you change a field that participates in some filter which generates a dataset to be used in that form, it creates an issue that a naive join now returns incorrect data (because the join condition itself was edited). Complex ORMs which help with that^ are not leaky abstractions, they just try to avoid mistakes an average programmer would do anyway in “trivial” SQL tasks without blinking once.

And yes, gp question about x vs y means that no thought of editing contexts was ever considered. Plain old fetch-store is too low-level and doesn’t represent a model that business logic thinks in.

^ Idk about EF in particular, just assuming

This is why I like TypeORM. There is no magic, and every command maps 1:1 with a database operation.

No weird caching, no auto saves. Just an object mapper that you can use when you want and ignore when you need to.

This example is incomplete. We need to see what the enclosing transaction scope looks like.
That's sort of my point - when you see a random dbcontext read inside a function you have no idea what the fetch will actually do. It might just return an object that was already fetched elsewhere in the context and modified but not saved. It might return the first value. It will automatically plug related entities into navigation collections - even if they are queried completely independently.
I concur with these. I see colleagues who use EF as 'black magic', who postpone or fear looking into what is happening under the hood. Because they lack insight into what it really does, they regularly cause horrendous queries to happen. My pet peeve with EF LINQ is, that your c# is not really c#, so you may write queries that compile silently, but fail to execute on runtime because the C# cannot be translated to SQL.
Funnily, one of my pet peeves is people worrying about the SQL Generated from EF Linq. If you care that much about it I think you should just be writing the SQL by hand.
I might even go a step farther - you shouldn't care. It's the equivalent of caring whether or not your generated HTML or compiled IL or Assembly "looks nice."

If the SQL is performant and it returns the expected data, that is good enough for 99.9% of cases.

There are cases where you have to care, e.g. the difference between AsSingleQuery() and AsSplitQuery() in EF Core. This option only affects what kind of queries EF Core will create to perform the same job, but it can have pretty significant implications on performance in some cases, and it can affect the consistency guarantees you get for your results.
The abstraction is great, until it breaks.

In this case, with ORMS, even good ones, this happens often enough in production that to actually master the tool you do need to care.

Another vote for EF Core here. It’s superb.
ORM is a very valuable tool and should be aggressively used. One can always step down to SQL as needed but otherwise, the ORM logic is easier to write and maintain.
the all-or-nothing approach is prevalent in both camps. i’ve worked in places where a straightforward optimization could not be implemented because it would require the developers to break from the orm-only standard they’d set.

i’ve also worked places where orm were held as such anathema that any orm proposal was dismissed out of hand without any sort of discussion.

As is almost always the case, the middle ground is better. In a few of the larger .NET projects I've worked on we would use EF until it became too much of a pain (or the business rules made the query unmanageable) and then someone would eventually spend a few days transitioning the query into a stored procedure + SQL functions, updating tests, etc. The biggest complaint I ever got from that was that it was impossible to tell what was going to be LINQ and what was going to be a stored procedure. While not an insignificant concern, it certainly beats "oh we can't optimize that 20-second long query because then we'd have to drop below the ORM and 'we don't do that here'" or "no you can't use any ORM even though it objectively eliminates a lot of boilerplate work."
The all-or-nothing approach is what makes people hate ORMs and go anti-ORM.
ORM logic is not easier to write and maintain unless you're bad at SQL and writing your queries without any tool support.
Strongly agreed.

In my experience all using an ORM accomplishes is making sure the people on your team who are amazing with SQL write just as bad queries as those who suck at SQL.

Yep entity framework is truly amazing. If you have used that ORM you never go back. You still need to sometimes make your own query for perf or other needs. But it's quite rare in my experience.

Most of the time when I had performce issues it isn't EF. It's a missed index or higher level query issue.

> Another classic is the “joins are slow” argument

The only person I knew who died on that hill would insist on doing two queries to the database, and then would insist on doing a client side cartesian join.

I remember getting beers with somebody in the aughts who claimed that he saw an entire website where the url was the key and the webpage was the value in an Oracle database. Any code was SQL operations inside the value field.
I once had a coworker who dreamed of that exact setup.
Isn’t that effectively what a CMS is?
That's amazing!
Are joins in a 5NF database now as fast as querying a denormalized database?
Maybe the joins are faster? It's really hard to tell without more context/detail.

I think many were burned by mysql back in the day - trying to use sql as a document database - or using php frameworks that happily did a hundred queries pr page view.

As a general rule of thumb, for a REST app - I'd say the db should be normalized, and the cache layer(s) can handle the denormalization.

Ie when you get /page=1 varnish can spit out a response from ram (which if you squint, is a denormalized projection of your data), or it can go talk to your app, that talks to the db. And the latter is most likely fast enough (tm).

Maybe I'm missing some context, but isn't that true by definition even if the db does nothing special? You either spend time sending N queries and waiting for responses, or join and use one query. Given actually matching scenarios for both, the one with less communication overhead wins.
In a normalized database that's true, but in a denormalized database, by definition, you get a third option, which is to have tables with redundant data that can be returned in a single query (as if it were pre-joined, I suppose).
Depends. Denormalized means the database contains redundant data. If a query have to scan 10x or 100x as many rows due to redundant data, it is obviously going to be slower. But it is hard to say anything general since denormalization will make some queries faster and other queries slower.
with good index you will not scan more rows.

But each query will use a different copy of the same data instead of joining with the same copy.

Storing both copy in memory take more space so you can’t cache as much in memory.

I’m not talking redis or memcached but the page cache inside the sql engine.

they always been faster! When you have 5NF, the database is smaller and all the row you join will be in memory in the SQL server PageCache.

While when using denormalized database, your read will have to go to the disk.

Seq scans will be faster in a normal form database, if you're seq scanning then joining other tables on an index it might be faster. Otherwise the denormalized table will probably be faster.
To be fair thata a reasonable approach if the database is at its monolithic scaling limit in CPU but not IO, while the clients can scale horizontally to more machines.

Unlikely in practice, though.

oh so your talking about using the database as a file system and moving all the query logic in the client
We used to do large setups at companies for what was then called intra and extranets begin 00s. These were very read/write intensive as the staff and partner staff would be on there basically all the time during office hours and data was not great for caching as data changed a lot especially in some companies like large hospitals and universities. We used mysql (I cannot remember why) and we did a lot of performance testing at that time; we removed all joins which made everything a lot faster. This is no longer the case now but indeed many people still believe it ; not (only) because they saw or tried it back then, but also because it’s less strain on the brain to just do single table selects and use not FKs or joins.
Back in the day I was forced to ditch FKs in my MySQL application, because I needed a FULLTEXT index on one of my columns, and MySQL only supported that type of index on MyISAM tables (this was on 5.x or something). MyISAM didn't do foreign keys.

It was a pretty central table, and the inability to use FKs there kinda spread outward.

Did you consider making a 1-1 relationship on a new table that only had the FULLTEXT column? Curious how you evaluated the trade offs
I can't remember how much time I spent thinking about it, but if I were to reenact my state of mind at the time, I probably concluded something like, "Without transactions, I'll have to write more code to make sure the ID in both tables stays in sync, and I'll have to send 2 separate INSERTs (sequentially) for every record added, and if the first one fails, I need to handle that, and if the 2nd one fails, I need to handle that differently, and... fuck it. I'll just promise to be good and not use FKs"

Or something. I can't remember the details, but I was (and still am) very averse to complexity in my application code.

> Another classic is the “joins are slow” argument

Along with the "indexes slow down INSERTs and UPDATEs" argument that you touch on. I mean, it is literally true that indexes make writes slightly slower, and an excessive quantity of indexes (which I have seen) can slow down writes enough to cause problems. But - in general - the slowdown is irrelevant compared with the overhead of querying a table that contains 2 billion rows using, oh, I don't know, a table scan because you don't have even a single index (I have also seen this).

One reason to avoid FK is when your database is partitioned to multiple servers, but that's obvious, I guess, and it's not really RDBMS anymore.
> RDBMSes are highly optimized pieces of software

> Much better than ORMs

These two things are not mutually exclusive though right?

It’s entirely possible to have a lightweight and relatively transparent ORM which makes full use of the underlying RDBMS.

Yeah, I was going to say something similar. But ORMs get blamed for obscuring what's going on, to the point that a developer may end up doing some sort of inefficient 1-to-n lookup that would've indeed been much better off as a SQL JOIN.

I use JPA/Hibernate professionally, as a decision maker, but I don't think I'm in either camp entirely. ORMs aren't a magic wand, but they do help you standardize the boilerplate that you'd end up with one way or the other, in most cases.

I definitely see that, and ORMs (particularly older ones) have historically made it easy to shoot yourself in the foot.

But, everything is an abstraction, and I tend to think that if you use any abstraction, you need to have at least a little bit of knowledge about what’s happening in the layer beneath it.

So using an ORM will not be an optimal experience if you don’t know how the underlying RDBMS works.

And effectively using an RDBMS directly still requires a bit of knowledge about the layer below that level of abstraction too (eg how underlying query optimisation works etc).

It’s possible to implement both incorrectly and get bad results and the opposite is true too

Agreed, and there's a lot you can gain from an ORM/query builder just in terms of ergonomics or niceness for the 80% use-case.

Doing intensive string manipulation to put your query together becomes painful, fast, especially when you're dealing with optional parts like ordering, limiting, filtering, pagination, etc. It's also incredibly easy to slip in an injection vulnerability as you do that (especially if you're new to programming).

Just don't use it as a crutch because the declarative nature of SQL is vastly more powerful than an imperative wrapper and you'll be at a loss for only knowing the conventions and opinions of your ORM of choice.

> in fact RDBMSes are highly optimized pieces of software that are especially good at combining sets of data. Much better than ORMs, anyway,

ORMs are just a wrapper around RDBMSes. If your ORM is producing incredibly stupid SQL to query the DB with, you might want to check that you're not modelling your data in a stupid way.

I am by no means an expert, but in general I have found that if the ORM is doing something particularly crazy, it's because my underlying assumptions about the data model is wrong.

> Your coworkers argument about FKs making data migrations difficult is one of them.

Got any arguments to back up this bald assertion?

In particular, I'd love to hear more about how to manage schema migrations on large tables with FK's without incurring lengthy locks or downtime.

Betting the answer is going to involve some variation on "well, don't do that" which is when I'll rest my case.

There are tools for live migrations for most popular databases. Also a lot of Postgres DDL is very fast and/or capable of happening live.
A lot of it depends on the use case. For example, Facebook - one of the largest (if not the largest) deployments of mysql does not allow any FK constrains. There’s multiple reasons, but one of those is better predictability of db operational perf - a row delete should delete just the row and not potentially trigger N cascading deletes.
I don't understand “a row delete should delete just the row and not potentially trigger N cascading deletes”. If you want that to not happen, then define that in the database definition. It sounds like you're saying that a core piece of functionality is somehow ‘wrong’, even though that same functionality can be used to make the desired bahviour for this exact use case explicit?
facebook data model is a Graph where each row store one object “comment” or one association “comment is with post id” between objects .

They made an query and indexing system on top of it to make it fast called TAO.

Without it you need to send a distinct SQL query pet parent object to get list of associated child object which would be awfuly slow.

Non-tao use cases of mysql at FB also cannot use FK constraints (or ‘triggers’).
by "no FK contraints" do you mean "no join using index" or you simply mean foreign key violation is not checked.
Cascading deletes is a separate from FK constraints. You can have FK constraints without cascading deletes.
How about when the ID in a FK column has been generated outside the RDBMS but the target of the ID has not been written yet?
You can use DEFERRABLE INITIALLY DEFERRED constraints so that the check happens when the transaction is committed.
I assume the target is externally generated too, thus can be legitimately absent.
I had this when importing test data; I found it acceptable (since it was just in development) to temporarily turn off FK checking.