Hacker News new | ask | show | jobs
by maksimum 2620 days ago
> Containers are meant to be stateless infrastructure. By downloading something at startup, you're breaking that contract implicitly.

I feel that mounting a NFS partition is a similar break of contract. I.e. you could see the same image behave differently depending on what's in the NFS partition. I feel like to get data in a "reproducible" way you need to pull it from a data versioning system. I think there's different ways to implement data versioning with their own trade-offs. NFS and S3, among others, could be used to implement data versioning.

I agree with you that in theory an NFS is more performant because it allows you to load lazily.

1 comments

Curious about how you'd scale with data versioning.

In any type of realtime, high bandwidth feed, I feel like what you're suggesting isn't cost effective for the benefits it provides.

If you need absolute reproducibility and back-testing or your feed is lower bandwidth, it maybe makes sense. But not for larger systems.

Interesting topic. :)

This is mainly relevant if your data is used for training.

It seems like you'd want to use a log-based system like kafka to manage versioning and state in this case. I imagine you could:

1. Store incoming training data in a "raw data" topic.

2. A model trainer consumes incoming training data, updates a model's state, and at a pre-determined period writes the model's state as of a given offset in the "raw data" topic in a "model state checkpoint" topic.

3. Then you probably have some "regression testing" workflow that reads from the "model state checkpoint" topic and upon success writes to a "latest best model" topic.

4. Workers that use the model in production read from the "latest best model" topic and update their state upon a change.

I imagine you could add constraints about "model" continuity or gradual release to production that would make the process more complex, but I feel like fundamentally kafka solves a lot of the distributed systems problems.