|
|
|
|
|
by sreekanthr
2615 days ago
|
|
I cannot answer your question without full understanding of what is the current usage of your data infrastructure. Few pointers - Who are users of the platform? If it is only used by data science team then you can rip apart the solution and work towards a more logical infrastructure where all you are doing is cleansing, normalizing and deriving features and these become your central feature repository which your team can pull and build models. You need a governance so that team is aligned on what features are present and how do they add new features to repository. At scale of 10 people it is much easier to have this all centralized, if team is scaling out then we will have to work out a de-centralization strategy. - If you have operational reports like business reporting & investor reporting running on this infra then I would recommend keeping analytics workload separate from operational workload. They have different needs and SLA's. One thing which worries me is you are talking about denormalization as something you are planning to do, that should have been the starting point of any HDFS/SPARK/Parquet based solution. I can suggest tools for explorations, data quality check etc. But that requires more understanding of what your current infrastructure is solving vs what it was intended to. |
|
> If it is only used by data science team then you can rip apart the solution and work towards a more logical infrastructure where all you are doing is cleansing, normalizing and deriving features and these become your central feature repository which your team can pull and build models.
It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.
> you are talking about denormalization as something you are planning to do, that should have been the starting point of any HDFS/SPARK/Parquet based solution.
It is something that we have to do, but the table have been dumped as is in S3 and every project rebuilds the whole derived dataset regularly. Since these operations are very brittle (a lot of manual work and even transformations performed in notebooks), this is something people dread doing. I am trying to secure this at the moment by writing Makefiles that remove human intervention, but at the end of the day I would like to avoid people spend hours waiting for new data when they need it.
> I can suggest tools for explorations, data quality check etc.
I would appreciate it. Put simply, we get data about the evolution of the stock of clients, transactions with their clients, product descriptions, etc. that is dumped into S3 (I scheduled a chat with people upstream to see what happens). We have 3,4 projects for each client. What currently happens is every team writes the same code to build features in their separate repositories, this code is re-executed every time new data arrives (weekly). These features are then used in prediction models.
Besides the brittleness of the process, I found that people are reluctant to analyse the data because it takes an unreasonable amount of time.