Hacker News new | ask | show | jobs
by stadium 1612 days ago
I'd start with learning about dimensional modeling, with one of the newer Kimble data warehouse toolkit books. And also document the data flow between all of these systems to make it easier to talk through options with stakeholders.

What type of user data is it? It sounds like you have multiple "input" sources of truth, and are wanting to create a single reporting dataset that has the latest info from each?

Once you can break down the datasets into facts and dimensions, it becomes more clear which update strategies to use for each. Master data like names and email addresses usually goes into a dimension table, and transactional data like purchases, page views into a fact table.

How do you know when each record has changed in each source? Are there edge cases where changes can't be detected? How much latency is acceptable in the reporting system? 1 day? 5 minutes? Those questions are important to explore early on.

You're getting into data engineering. It's a great field to learn and the job market is hot. But if you are wanting to avoid data wrangling completely, check out kloud.io and fivetran. They do a lot of the heavy lifting. If they support most of your systems it could save you and your company a lot of time. (I'm not affiliated but know both products well from firsthand experience)

1 comments

thank you! We will use the data for customer segmentation purposes. For that, we will need to transform all the user's sources into a single entity for a given user, for example. And link that entity/user to another table, with their purchases so we can do the customer segmentation. I'm not sure about latency but honestly that's not really important at the moment, I just need to have a strategy to make an end-to-end solution for gathering the data, transform it and delivery a concise and coherent "user table" for the machine learning dude.
You're probably looking at a users dimension table. There are different "types" of update strategies for dimension tables. I'd start by figuring out which type meets your stakeholders' needs. Some keep the history, others only keep the current state.

Usually I'd recommend to bring the raw data into your database first before transforming it. It's hard/impossible to predict future needs and this buys you flexibility. "ELT" describes this approach (vs ETL)

Good luck!