Hacker News new | ask | show | jobs
by ethbro 2620 days ago
Curious about how you'd scale with data versioning.

In any type of realtime, high bandwidth feed, I feel like what you're suggesting isn't cost effective for the benefits it provides.

If you need absolute reproducibility and back-testing or your feed is lower bandwidth, it maybe makes sense. But not for larger systems.

1 comments

Interesting topic. :)

This is mainly relevant if your data is used for training.

It seems like you'd want to use a log-based system like kafka to manage versioning and state in this case. I imagine you could:

1. Store incoming training data in a "raw data" topic.

2. A model trainer consumes incoming training data, updates a model's state, and at a pre-determined period writes the model's state as of a given offset in the "raw data" topic in a "model state checkpoint" topic.

3. Then you probably have some "regression testing" workflow that reads from the "model state checkpoint" topic and upon success writes to a "latest best model" topic.

4. Workers that use the model in production read from the "latest best model" topic and update their state upon a change.

I imagine you could add constraints about "model" continuity or gradual release to production that would make the process more complex, but I feel like fundamentally kafka solves a lot of the distributed systems problems.