|
|
|
|
|
by roenxi
1131 days ago
|
|
In the case of this analyst, say every couple of days he is writing SELECT * FROM events JOIN projects. Everyone is asking him for information about events and projects. He's going to save time and make less mistakes if he uses DBT and has a denormalised tables that merges events and projects. However you can think of to define costs and benefits, it'll turn out to be a good choice. As he discovered in the article. Good move. People keep asking him to do it, there is probably value there. He is avoiding a cost which is writing the same join over and over again. Normalisation is there to make data accessible for multiple different users (analysts, application programmers, infrastructure teams, people who want to leverage the database for a new purpose, etc). It isn't good at servicing any specific need, but it is a basic and general data layout that lets people tailor the data to their needs quickly. When there is a specific user, they should always be asking if the normalised data layout is helpful and looking for opportunities to avoid writing the same JOIN in 20 different queries. As long as the source of truth is in normal form it is reasonable practice to denormalise for specific use cases. DBT does this really well I might add - it encourages normalising the source of truth and then denormalising the data by an analytics team to meet business needs. The ideas there are strong, flexible and encourage good practices. Analysts love big flat tables, they are easy to work with. |
|
Normalization is there to avoid anomalies which is another word for data corruption. If you have the same datum repeated multiple times in the base tables, any update can (and probably will, due to Murphys law) lead to inconsistencies, which mean the database is not a reliable source of information anymore. How do you quantify the cost of that?
> looking for opportunities to avoid writing the same JOIN in 20 different queries.
Then you define a view, with is literally just a named and reusable query, which can be used in other queries. Writing queries or using views is certainly not "denormalization". Having redundant data in a query or view output is commonplace and not a problem since it cannot introduce update-anomalies. (Some databases allow updateable views in some scenarios, but only when the view can be unambiguously mapped back to base tables, so no risk of update-anomalies here either.)