| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by roenxi 1131 days ago

In the case of this analyst, say every couple of days he is writing SELECT * FROM events JOIN projects. Everyone is asking him for information about events and projects.

He's going to save time and make less mistakes if he uses DBT and has a denormalised tables that merges events and projects. However you can think of to define costs and benefits, it'll turn out to be a good choice. As he discovered in the article. Good move. People keep asking him to do it, there is probably value there. He is avoiding a cost which is writing the same join over and over again.

Normalisation is there to make data accessible for multiple different users (analysts, application programmers, infrastructure teams, people who want to leverage the database for a new purpose, etc). It isn't good at servicing any specific need, but it is a basic and general data layout that lets people tailor the data to their needs quickly. When there is a specific user, they should always be asking if the normalised data layout is helpful and looking for opportunities to avoid writing the same JOIN in 20 different queries. As long as the source of truth is in normal form it is reasonable practice to denormalise for specific use cases.

DBT does this really well I might add - it encourages normalising the source of truth and then denormalising the data by an analytics team to meet business needs. The ideas there are strong, flexible and encourage good practices. Analysts love big flat tables, they are easy to work with.

4 comments

goto11 1130 days ago

> Normalisation is there to make data accessible for multiple different users

Normalization is there to avoid anomalies which is another word for data corruption. If you have the same datum repeated multiple times in the base tables, any update can (and probably will, due to Murphys law) lead to inconsistencies, which mean the database is not a reliable source of information anymore. How do you quantify the cost of that?

> looking for opportunities to avoid writing the same JOIN in 20 different queries.

Then you define a view, with is literally just a named and reusable query, which can be used in other queries. Writing queries or using views is certainly not "denormalization". Having redundant data in a query or view output is commonplace and not a problem since it cannot introduce update-anomalies. (Some databases allow updateable views in some scenarios, but only when the view can be unambiguously mapped back to base tables, so no risk of update-anomalies here either.)

link

datathrow0007 1130 days ago

Use a view that acts as a quick-and-useful abstraction to mimic a denormalized table?

E.g.

``` CREATE VIEW vw_events_and_projects AS SELECT * FROM events JOIN projects ```

Then

``` SELECT * FROM vw_events_and_projects ```

Edit:

And if you need OLAP, replicate the normalized table to a database that handles analytics workflows better (e.g. ClickHouse).

Then you get the normalized forms for your OLTP workflows (your "bread and butter"); and you also get the efficiency and ergonomics of real-deal OLAP.

Of course, your biggest issue is going to be keeping the two in-sync. Obvious case is to have your OLTP database stream synchronization data to the replica whenever data is modified.

link

roenxi 1130 days ago

You'd like DBT, you should go and read up on it. It uses views.

link

cm2187 1131 days ago

Plus there are many cases where you want to see the data as of a given time, for instance all the positions of a fund at a given time. Then it makes sense to denormalise because this data should never be updated in the future.

link

im3w1l 1131 days ago

What about using logs as the source of truth? It's a pattern I've seen multiple times.

link