Hacker News new | ask | show | jobs
by jongjong 1387 days ago
I think the author is talking about 'foreign key constraints' - You could have foreign keys without enforcing a constraint.

Personally, I don't use foreign key constraints because:

1. It makes schema migrations and other data-management operations more difficult.

2. On insertion, the database needs to perform an additional check to verify that the record exists at the foreign key; this carries a performance cost; IMO, this is something which should be enforced at the application layer anyway.

3. It makes it more difficult to scale the database later because you can't separate tables onto 2 different hosts if one table references another using a foreign key.

BTW, about #3, the same argument can be made against using table joins. Once you start using foreign keys or table joins, you will be forced to run those two tables on the same host in the foreseeable future; it's very difficult, error-prone and time consuming to migrate away from such architecture if you have a lot of data in a live environment. Personally I prefer to design all my tables and front end applications to not rely on foreign keys or table joins. There is a good reason why databases which are focused on scalability (like MongoDB) do not support foreign keys or joins (or at least they try to avoid them).

I prefer to assemble data on the front end as much as possible because it allows my REST API calls to be granular; each one only refers to a single kind of resource; this helps to simplify caching and real-time updates; it also uses fewer resources on the server side and I find that it makes the front-end code more maintainable. Also, I like to design my front ends to mirror the natural separation of resources within the database. When the user wants to open up a related record, they need to click on a link (the foreign key ID/UUID is used to construct the link to the related resource); this loads up the other record as a separate step. This creates a very smooth (and fast) user experience - I also like it because this approach does not overload the user with information; collections of items don't show much details, on the other hand, individual resources may show a lot of detail.

The real reason people use joins is because they want to pack a lot of details onto the user's screen when they are looking at a list view... Sometimes the reason why they want to do that is because they didn't design their tables correctly; maybe the tables which they use to generate list views don't contain enough columns/detail to be useful on their own so they feel forced to do joins. I find that drawing ER diagrams helps a lot with that. It's very important to get the cardinality of relationships between the different tables exactly right. Also, I find it very helpful to represent any many-to-many relation between two tables as a distinct table.

2 comments

On point 3 it should be noted that it's almost always a mistake to optimize for scale at the start of a projects lifetime. There will be exceptions, but in general this is true.

You can always migrate that data to a more useful format if you find it starts hurting you at scale, if you start with the assumption you need the scale you're hurting yourself in the here and now for theoretical future benefit.

> The real reason people use joins is because they want to pack a lot of details onto the user's screen when they are looking at a list view

This is completely, emphatically wrong. I'm somewhat miffed at the air of authority you're using here. People use joins for the normalization of data.

This perspective only makes sense of you assume that designing a scalable system requires MORE work. My experience is that designing a scalable system requires LESS work if you and your team have the right skillset.

In most cases, I can build a scalable system faster than I can build a non-scalable one with the same feature set.

It would make no sense for me to implement the lesser alternative if it requires the same or more work.

I'm always leery of people who claim to be senior and have never spent 3-5 years on the same system, and this attitude is why.

It takes at least that long to really start surfacing the design errors that were made that kills productivity long-term in a system. As a result I very often will claim the difference between a skilled and unskilled developer is the ability for a system they built to be reasonable after 5+ years without everyone involved wanting to rebuild the entire thing from scratch.

IOW, this is a fundamental difference in perspective. I was speaking to creating systems that are maintainable over the long haul by actively trying to control complexity. You're speaking speed of initial development.

Rich Hickey went on a small rant in one of his videos (I think the one describing datomic, but could be wrong) in which he pointed out many things that are fast initially will hurt long-term. I agree with that sentiment wholeheartedly.

The fact that you called the less complex alternative the "lesser" alternative speaks volumes. It honestly feels like the whole "mongodb is webscale" devbro culture rearing its ugly head.

I tend to prefer combining data at the last moment on the client side rather than having it pre-combined on the server side (I prefer REST philosophy over GraphQL). It's probably because I'm web-application focused and so scalability and concurrency is far more important to me than raw execution time. Maybe if I was a data scientist or embedded systems developer, I would care more about execution time. I've met people like that. But IMO performant scripts tend to be the result of more optimizations which makes them harder to maintain as the underlying engines or hardware changes.
This has nothing to do with raw execution time.
> I prefer to assemble data on the front end as much as possible because it allows my REST API calls to be granular

It's clear you have never work with a lot of data.

> The real reason people use joins is because they want to pack a lot of details onto the user's screen

I hate this illusion that web programming is the whole of software development.

> It's clear you have never work with a lot of data.

Sure, I only wrote an open source distributed pub/sub system with channel-based sharding which has been used by thousands of companies to support hundreds of thousands of concurrent users, but I guess 'lots of data' is a relative term.

That has nothing to do with data or data modeling.
Well that was just my hobby and side-gig... As part of my day jobs, I also worked on many projects with different databases including MySQL, Postgres, SQLite, MongoDB. I also implemented a side project (a distributed financial transaction processing system) using RethinkDB with per-table sharding and replication which runs on Kubernetes with statefulsets for persistence with automatic deployment and autoscaling and automatic database shard re-balancing with high availability with eventual consistency; I used a 2-phase commit algorithm for certain operations to achieve reliability in the event of write failure; as not to rely on atomic database transactions. I also did a course on relational database modeling at university (focused on ER diagrams and database normalization). I worked in the blockchain sector. I wrote a stateful, quantum-resistant blockchain from scratch including the cryptographic signature algorithm which uses an improved Lamport OTS variant suggested by Ralf Merkle and which uses a Merkle Signature Tree for key reuse and I contributed to the front end too. I also wrote a deterministic, fork-resistant, idempotent, heterogeneous multi-chain, chain-to-chain decentralized exchange. I also lead a team which wrote a P2P networking library with decentralized routing and efficient propagation of messages to peers belonging to the same subnets - Nodes in the network organized themselves into an unstructured, partial mesh topology with peer shuffling to avoid eclipse attacks but still retained the ability to form subnets based on the features they supported. But still, "a lot of data" a relative term.
None of that is about data, it's about distributed computing.

No one is saying you're not a smart guy with skills, just that you're obviously not familiar with working with lots of data.