| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tomasdpinho 2620 days ago
	Containers are meant to be stateless infrastructure. By downloading something at startup, you're breaking that contract implicitly. Secondly, depending on where you're deploying, downloads from S3 (and then loading to memory) may take a non-negligible amount of time that can impact the availability of your pods (again, depending on their configuration). Synchronicity everywhere may cause request loss if your ML pipeline is not very reliable, which in most cases it isn't. Relying on a message queuing system will also increase system observability because it's easier to expose metrics, log requests, take advantage of time travelling for debugging, etc.

2 comments

maksimum 2620 days ago

> Containers are meant to be stateless infrastructure. By downloading something at startup, you're breaking that contract implicitly.

I feel that mounting a NFS partition is a similar break of contract. I.e. you could see the same image behave differently depending on what's in the NFS partition. I feel like to get data in a "reproducible" way you need to pull it from a data versioning system. I think there's different ways to implement data versioning with their own trade-offs. NFS and S3, among others, could be used to implement data versioning.

I agree with you that in theory an NFS is more performant because it allows you to load lazily.

link

ethbro 2620 days ago

Curious about how you'd scale with data versioning.

In any type of realtime, high bandwidth feed, I feel like what you're suggesting isn't cost effective for the benefits it provides.

If you need absolute reproducibility and back-testing or your feed is lower bandwidth, it maybe makes sense. But not for larger systems.

link

maksimum 2620 days ago

Interesting topic. :)

This is mainly relevant if your data is used for training.

It seems like you'd want to use a log-based system like kafka to manage versioning and state in this case. I imagine you could:

1. Store incoming training data in a "raw data" topic.

2. A model trainer consumes incoming training data, updates a model's state, and at a pre-determined period writes the model's state as of a given offset in the "raw data" topic in a "model state checkpoint" topic.

3. Then you probably have some "regression testing" workflow that reads from the "model state checkpoint" topic and upon success writes to a "latest best model" topic.

4. Workers that use the model in production read from the "latest best model" topic and update their state upon a change.

I imagine you could add constraints about "model" continuity or gradual release to production that would make the process more complex, but I feel like fundamentally kafka solves a lot of the distributed systems problems.

link

kamac 2620 days ago

> By downloading something at startup, you're breaking that contract implicitly.

Nitpicking here, but if you can ensure that certain version is downloaded, then the contract isn't violated.

link