|
|
|
|
|
by jamesblonde
2208 days ago
|
|
All your points are valid points. However, operational models (models used by online applications, for example) typically need access to lots of historical features that are not available in the application. In that case, you need to go to a low-latency database/store to get your feature values (build your feature vectors). If you want to reuse those features in different models, you will need join support for building the feature vectors, so a key-value DB won't help there. Now, your features are duplicated between this online/serving layer and the data warehouse. How do you sync them up?
The other thing you're missing is time-travel queries (temporal logic for SQL in data warehouse speak). Yes, Delta Lake gives you this, but you will need to wrap that data in APIs so that your data scientists will be able to use it. For data drift, a library alone won't cut it. You need to compare descriptive statistics/distributions of the data used to train the model and the live data coming in. Where do you get those statistics from - the feature store, in our case (with the help of versioning+metadata). Then, there is end-to-end governance of ML models - what training dataset was used to train this model, can i reproduce that training dataset if it hasn't been archived? You need metadata to manage all that. So, yes you can do it - but you have to build something (as the article describes) or buy it. |
|