| Hey HN. I'm facing this challenge at my job where I receive user's data from N sources and need to create a single source of truth (a table with single entities for each user). I'm looking for best practices, I don't even know how to start it. When I look for that on the web I find many random "tutorials" where it ends up trying to sell you a tool for that purpose.
If anyone could point me in a good direction/book/tutorial on that matter (Golden Record, SSOT, MDM, etc.) I would really appreciate it. Ps: I use GCP (BigQuery SQL) and Python. Thanks! |
What type of user data is it? It sounds like you have multiple "input" sources of truth, and are wanting to create a single reporting dataset that has the latest info from each?
Once you can break down the datasets into facts and dimensions, it becomes more clear which update strategies to use for each. Master data like names and email addresses usually goes into a dimension table, and transactional data like purchases, page views into a fact table.
How do you know when each record has changed in each source? Are there edge cases where changes can't be detected? How much latency is acceptable in the reporting system? 1 day? 5 minutes? Those questions are important to explore early on.
You're getting into data engineering. It's a great field to learn and the job market is hot. But if you are wanting to avoid data wrangling completely, check out kloud.io and fivetran. They do a lot of the heavy lifting. If they support most of your systems it could save you and your company a lot of time. (I'm not affiliated but know both products well from firsthand experience)