Hacker News new | ask | show | jobs
by roenxi 1131 days ago
Well...

1) Normalisation at all costs is foolish - if the cost exceeds the value, then don't do it. That isn't complicated. Denormalised data sometimes points at design flaws, but even then all systems have design flaws and they don't automatically need to be fixed. Quality is expensive, like every other property (even doing things the cheap way is expensive, ironically - software is all about managing costs).

2) For any given user it is better to have denormalised data where the data model is perfectly aligned to their use case. For a system with multiple users it is better to have normalised data. And the corollary is that any data important enough to be recorded is probably valuable enough that it will eventually have multiple interested users even if the person building the system swears that this time is different - so they should normalise their data. Brownie point to anyone who has reached enlightenment and understands the you of 12 months hence is a different user with different needs of the data.

6 comments

Regarding quality being expensive.

It's not only about cost to implement. There's also cost to change.

If your isolated module is bad, you can rewrite the code, keeping API the same. Cost to change is not high. You might introduce new bugs and that's about it.

Changing database structure might be hard. Adding new checks might require manual fixes to already bad data or multiple code paths for old and new data. Some migrations might require putting system offline. Often you can't just rollback your changes if things went wrong after few days.

Changing API with hundreds of customer... Good luck with that.

Changing POSIX API at this moment probably just not possible.

Whenever something is hard to change, quality requirements are naturally higher.

For data model quality requirements should be high. More time you spend, more time will be saved later. As we say: we're not so rich to buy cheap things. We're not so rich to afford poor DB schemas.

As usual, context matters and decontextualized discussions often devolve into people shouting past each other.

Sometimes you can't afford to do it right, quick and dirty is the way. Sometimes you can't afford not to do it right. It all depends heavily on how costs and payoffs are distributed socially and temporally.

The real trick is to be doing what fits your situation at the moment, and knowing how your situation might change over time.

> For any given user it is better to have denormalised data where the data model is perfectly aligned to their use case. For a system with multiple users it is better to have normalized data.

I don't see why you are making the distinction by single user or multi user. If you are writing information into the database, it has to be written into a normalized schema. There is no alternative, multi user or not. To do otherwise would be to maintain multiple sources of truth and nothing in the database would make sense if two different sources of data go out of sync. Make illegal states unrepresentable.

Also if you want to align your data model perfectly to a user you might as well use a single-table database like DynamoDB [1] where everything is screaming fast but the downside is you can't modify your schema at all (because it's already set in stone to perfectly model the old use case).

[1] https://news.ycombinator.com/item?id=18824164

> if the cost exceeds the value, then don't do it. That isn't complicated.

Could you try to elaborate on what you mean by "cost" and "value"? The complicated part is not the part you said, but specifying those two very abstract terms.

In the case of this analyst, say every couple of days he is writing SELECT * FROM events JOIN projects. Everyone is asking him for information about events and projects.

He's going to save time and make less mistakes if he uses DBT and has a denormalised tables that merges events and projects. However you can think of to define costs and benefits, it'll turn out to be a good choice. As he discovered in the article. Good move. People keep asking him to do it, there is probably value there. He is avoiding a cost which is writing the same join over and over again.

Normalisation is there to make data accessible for multiple different users (analysts, application programmers, infrastructure teams, people who want to leverage the database for a new purpose, etc). It isn't good at servicing any specific need, but it is a basic and general data layout that lets people tailor the data to their needs quickly. When there is a specific user, they should always be asking if the normalised data layout is helpful and looking for opportunities to avoid writing the same JOIN in 20 different queries. As long as the source of truth is in normal form it is reasonable practice to denormalise for specific use cases.

DBT does this really well I might add - it encourages normalising the source of truth and then denormalising the data by an analytics team to meet business needs. The ideas there are strong, flexible and encourage good practices. Analysts love big flat tables, they are easy to work with.

> Normalisation is there to make data accessible for multiple different users

Normalization is there to avoid anomalies which is another word for data corruption. If you have the same datum repeated multiple times in the base tables, any update can (and probably will, due to Murphys law) lead to inconsistencies, which mean the database is not a reliable source of information anymore. How do you quantify the cost of that?

> looking for opportunities to avoid writing the same JOIN in 20 different queries.

Then you define a view, with is literally just a named and reusable query, which can be used in other queries. Writing queries or using views is certainly not "denormalization". Having redundant data in a query or view output is commonplace and not a problem since it cannot introduce update-anomalies. (Some databases allow updateable views in some scenarios, but only when the view can be unambiguously mapped back to base tables, so no risk of update-anomalies here either.)

Use a view that acts as a quick-and-useful abstraction to mimic a denormalized table?

E.g.

``` CREATE VIEW vw_events_and_projects AS SELECT * FROM events JOIN projects ```

Then

``` SELECT * FROM vw_events_and_projects ```

Edit:

And if you need OLAP, replicate the normalized table to a database that handles analytics workflows better (e.g. ClickHouse).

Then you get the normalized forms for your OLTP workflows (your "bread and butter"); and you also get the efficiency and ergonomics of real-deal OLAP.

Of course, your biggest issue is going to be keeping the two in-sync. Obvious case is to have your OLTP database stream synchronization data to the replica whenever data is modified.

You'd like DBT, you should go and read up on it. It uses views.
Plus there are many cases where you want to see the data as of a given time, for instance all the positions of a fund at a given time. Then it makes sense to denormalise because this data should never be updated in the future.
What about using logs as the source of truth? It's a pattern I've seen multiple times.
I find it interesting how the argument of the current top comment[1] - that denormalized tables are great, as long as you never have any updates for existing entities - basically follows directly from your point 2) :

If for any individual use case a denormalized table fit specifically to that usecase is superior to a normalized table, the most straightforward solution for multiple use cases would be to just keep multiple tables and replicate any insert on each of them. This would work relatively well until the moment where you have to update something...

[1] https://news.ycombinator.com/item?id=35927226

> Normalisation at all costs is foolish - if the cost exceeds the value, then don't do it.

Value of denormalization: May improve query performance for some particular use cases. (But probably not any more than using a materialized view)

Cost of denormalization: Risk of anomalies (inconsistencies) which mean the database is not a reliable source of information anymore. Murphys law mean that this is basically bound to happen sooner or later. You should worry about this.

Just use a view.

I kinda observe this from a different point of view: what are the use cases needed for the data and what are the queries that could satisfy those?

The dynamodb book was enlightening from this point of view even for designing sql databases where we normalize data as much as possible no matter what.

The book is this one: https://www.dynamodbbook.com/
Hi there, when you say the dynamodb book, do you mean https://www.dynamodbbook.com/?
Yes, I mean that one!
Which DynamoDB book? It's very relevant to my interests.