| HN Mirror

Interesting topic. :)

This is mainly relevant if your data is used for training.

It seems like you'd want to use a log-based system like kafka to manage versioning and state in this case. I imagine you could:

1. Store incoming training data in a "raw data" topic.

2. A model trainer consumes incoming training data, updates a model's state, and at a pre-determined period writes the model's state as of a given offset in the "raw data" topic in a "model state checkpoint" topic.

3. Then you probably have some "regression testing" workflow that reads from the "model state checkpoint" topic and upon success writes to a "latest best model" topic.

4. Workers that use the model in production read from the "latest best model" topic and update their state upon a change.

I imagine you could add constraints about "model" continuity or gradual release to production that would make the process more complex, but I feel like fundamentally kafka solves a lot of the distributed systems problems.