|
|
|
|
|
by ethbro
2620 days ago
|
|
Curious about how you'd scale with data versioning. In any type of realtime, high bandwidth feed, I feel like what you're suggesting isn't cost effective for the benefits it provides. If you need absolute reproducibility and back-testing or your feed is lower bandwidth, it maybe makes sense. But not for larger systems. |
|
This is mainly relevant if your data is used for training.
It seems like you'd want to use a log-based system like kafka to manage versioning and state in this case. I imagine you could:
1. Store incoming training data in a "raw data" topic.
2. A model trainer consumes incoming training data, updates a model's state, and at a pre-determined period writes the model's state as of a given offset in the "raw data" topic in a "model state checkpoint" topic.
3. Then you probably have some "regression testing" workflow that reads from the "model state checkpoint" topic and upon success writes to a "latest best model" topic.
4. Workers that use the model in production read from the "latest best model" topic and update their state upon a change.
I imagine you could add constraints about "model" continuity or gradual release to production that would make the process more complex, but I feel like fundamentally kafka solves a lot of the distributed systems problems.