Hacker News new | ask | show | jobs
by geebee 2102 days ago
Nice link. Nothing controversial, but sometimes simplicity is controversial in our field.

I've slowly come around to seeing proper database design as the most essential foundation of an IT system. I remember reading "your data will outlast your application", and I've been around as a developer long enough to have lived it.

One big anti-pattern I've seen with ORMs is that developers who don't really think in terms of data and relationships use the ORM as a kind of object serialization usable only from the application. Rather than thinking of the database as something useful that could be queried and accessed outside the context of the application, they write objects out to various tables and then re-import and re-construct them once they're needed again in the app, often with dependencies that are in code or even yaml configuration files. The upshot is that you simply can't use the database as anything other than a persistence tier for an app. It really would be no different if they had simply given the objects and id and pickled them to disk. The resulting data store really is that inaccessible and meaningless outside the context of the app.

As a result, if an analyst wants a report, they can't write SQL to get those reports, even though the persistence tier is, if perhaps in name only, a "relational database". And when the app goes away (as it inevitably does), they'll still want to know, say, how many beakers and test tubes were ordered by a lab tech who participated in 100 or more experiments per year with at least one faculty member from radiology between June and December in 2007.

But because they don't really know SQL, they see no value to it, and they're honestly just kinda irritated that they don't have an object database, which is what they understand a database to be - a way to pickle and reimport objects. The application outlasts the developer, and the data outlasts the application.

Yeah, if you're writing an app that will hold potentially useful data, definitely think about how this will be accessed outside the context of the app that perhaps inspired its creation, and think about how you'd get at the data if the application went away. That'll probably lead you to old, good database design.

Now, I actually do think that you can create a decent relational database through an ORM. I saw this go sideways with Rails a bunch of times, but I think that's probably because Rails made it so easy to start developing that a lot of people new to application development skipped the design stage of the data backend. You can construct a pretty robust database with migrations rather than CREATE TABLE statements, as long as you stick to the basic principles - but the technology makes it relatively easy to do the opposite and never really get into the mindset of data.

2 comments

> Nothing controversial

ahem

> Foreign Key constraint is the king of the relational database design

Amazon does not use FK constraints and I have rarely run into systems that do (since 1996ish). Most people with big enough datasets learn not to use them. The overhead for orphaned data is far less than the consequences of using them.

Ah. Fair enough, yes, I agree. That statement goes a little too far, and is a bit at odds with the otherwise overall light tone of the article and the YMMV-ish disclaimer at the end.
Can you clarify about the overhead you're speaking of?

I assume it only comes into play at super massive scale like Amazon-level datasets.

Not even super massive like amazon, even an app for a few million users you’ll run into the performance problems of FKs. They are really overrated in their usefulness - as parent says, orphan records are really not a big deal. As soon as you get to any moderate scale, dealing with a small level of data inconsistencies is inevitable. Especially as you grow into a multiple services, multiple databases type architecture where you fundamentally have to handle breaking fks btw systems
When not using FK constraints, data may be modified or lost (deleted) without proper cleanup of related records leading to "orphaned" records without relationships (or related ids that do not correspond to existing records). The disk space for these orphaned records and the time to run a relational integrity checks across tables (removing orphans), is of minor concern and effort, comparative to a troublesome database entry with FK constraints.
> One big anti-pattern I've seen with ORMs is that developers who don't really think in terms of data and relationships use the ORM as a kind of object serialization usable only from the application. Rather than thinking of the database as something useful that could be queried and accessed outside the context of the application, they write objects out to various tables and then re-import and re-construct them once they're needed again in the app, often with dependencies that are in code or even yaml configuration files. The upshot is that you simply can't use the database as anything other than a persistence tier for an app. It really would be no different if they had simply given the objects and id and pickled them to disk. The resulting data store really is that inaccessible and meaningless outside the context of the app.

> As a result, if an analyst wants a report, they can't write SQL to get those reports, even though the persistence tier is, if perhaps in name only, a "relational database". And when the app goes away (as it inevitably does), they'll still want to know, say, how many beakers and test tubes were ordered by a lab tech who participated in 100 or more experiments per year with at least one faculty member from radiology between June and December in 2007.

> But because they don't really know SQL, they see no value to it, and they're honestly just kinda irritated that they don't have an object database, which is what they understand a database to be - a way to pickle and reimport objects. The application outlasts the developer, and the data outlasts the application.

I'm on the developer's side here. Databases are too big and complex to be used by more than one application; if you have a database then it's really important to have a single owner for that database, or you'll never be able to evolve the schema. If you let an analyst write their own queries against your database, you have no control over what queries are running from where or for whom, so you can't so much as rename a column. And sooner or later your production system will lock up because an analyst wrote a query that they didn't realise had too many joins in.

If it's important to have an archival record of your data, put that in your design constraints and build it into the system. Even then, I probably wouldn't pick an SQL database as the system for doing it with.

Even if you have an ETL pipeline to an OLAP database/data warehouse/etc, if your core database design is hostile to the analytics/etc then it's going to be a pain no matter how carefully they use it.

> it's really important to have a single owner for that database, or you'll never be able to evolve the schema...

IMO, the "owning" application/developers reserve the right to evolve the schema-and if that temporarily breaks ETL, then so be it, but the underlying schema itself shouldn't be hostile to analytics/etc.

> Even if you have an ETL pipeline to an OLAP database/data warehouse/etc, if your core database design is hostile to the analytics/etc then it's going to be a pain no matter how carefully they use it.

Disagree. You don't need a single "core database design". It's fine to have different representations of your data for different purposes, and a transformation pipeline between them; that's the whole idea of CQRS etc.

Yes I'm not disagreeing there, I'm all for pipelines and CQRS and dedicated databases for dedicated purposes. The point I'm making is that if the original schema is a pain to work with, you can have as many pipelines and databases as you want, getting the actual data you want isn't any less of a pain.
> if the original schema is a pain to work with, you can have as many pipelines and databases as you want, getting the actual data you want isn't any less of a pain.

I don't think that's really true. If the original schema is just something you're ingesting before transforming then it doesn't really matter how bad it is; all you're gonna be doing is scanning over all the tables one way or another.