| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dark-star 1369 days ago
	I'd still consider it as "performance issue", not "reliability issue". There is no service unavailability here. It just takes your system a minute longer until the target GPU capacity is available. Until then it runs on fewer GPU resources, which makes it slower. Hence performance. The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really

1 comments

irrational 1369 days ago

I’d like to see a breakdown of the cost differences. If the costs are nearly equal, why would I not choose the one that has a faster startup time and fewer errors?

link

campers 1369 days ago

With GCP you can right-size the CPU and memory of the VM the GPU is attached to, unlike the fixed GPU AWS instances, so there is the potential for cost savings there.

link