Hacker News new | ask | show | jobs
by SirOibaf 2250 days ago
https://logicalclocks.com with their ML + Feature Store open source platform Hopsworks and their managed cloud version https://hopsworks.ai
1 comments

Disclaimer: i am a co-founder of Logical Clocks. There are loads of interesting technical challenges in this "Feature Store" space. Here are just a few we address in Hopsworks:

1. To replicate models (needed for regulatory reasons), you need to commit both data and code. If you have only a few models, fine just archive the training data. But, if you have lots of models (dev+prod) and lots of data - you can't use git-based approaches where you commit metadata and make immutable copies of data. It scales (your data!) badly. We are following the ACID datalake approach (Apache Hudi), where you store diffs of your data and can issue queries like "Give me training data for these features as it was on this date".

2. You want one feature pipeline to compute features (not one for training and a different one when serving features). Your feature store should scale to store TBs/PBs of cached features to generate train/test data, but should also return feature vectors in single ms latency for online apps to make predictions. What DB has those characteristics? We say none, and we adopt a dual-DB approach with one DB for low-latency and one for scale-out SQL. We use open-source NDB and Hive on our HopsFS filesystem - where all 2 DBs and the filesystem share the same unified, scale-out metadata layer (a "rm -rf feature_group" on the filesystem also automatically cleans up Hive and feature metadata)

3. You want to be able to catalog/search for features using free-text search and have good exploratory data analysis. The systems challenge here is how to allow search on your production DB with your features. Our solution is that we provide a CDC API to our Feature Store, and automatically sync extended metadata to Elastic with an eventually consistent replication protocol. So when you 'rm -rf ..' on your filesystem, even the extended metadata in Elastic is automatically cleaned up.

4. You need to support reuse of features in different training datasets. Otherwise, what's the point? We do that using Spark as a compute engine to join features from tables containing normalized features.

References:

* https://www.logicalclocks.com/blog/mlops-with-a-feature-stor... * https://ieeexplore.ieee.org/document/8752956 (CDC HopsFS to Elastic) * http://kth.diva-portal.org/smash/get/diva2:1149002/FULLTEXT0... (Hive on HopsFS)