Hacker News new | ask | show | jobs
by alando46 1043 days ago
Problem for our use case is saving on gpus is pointless if we have to keep paying egress fees for our 250 TB training dataset.

The single interface for any cloud GPU is cool, but hard to imagine it taking off without some additional features.

I think for lots of shops the hardest part isn't the compute but moving the data around. Ie for us, we use s3, some lustre caching and spot instance nodegroups. We are a deep learning research team that spends roughly 40-50k/month on aws compute for training jobs. I imagine this is somewhat mid tier, maybe more than some but certainly far less than others.

For inference, data egress costs could be less of an issue, but your service would really need to be almost invisible. It probably would be pretty complicated for a number of reasons, but if you could design a "virtual on-demand nodegroup"™ that I could add to my existing clusters and then map to whatever k8s stuff I want, that would probably be useful. I would need to be able to auto deploy a base image to the machine and then provision the node and register with my cluster.

Just some unorganized thoughts. Good luck and have fun.

1 comments

What are your use case that justify a cost 40-50k/month just for training?