|
|
|
|
|
by sillysaurusx
1054 days ago
|
|
It’s not euw4a. It’s everywhere. The allocation algorithm across the board kills off TPUs after no more than a couple hours. usc1f, usc1a, usc1c, euw4a; it makes no difference. It would be funny if someone set gpt-2-15b-poetry (our project) in some special way to prevent us from making TPUs that ever last more than a few hours, but from what I’ve heard from other people, this isn’t the case. That’s what I mean about the left hand doesn’t know what’s going on with the right hand. It’s not a misconfiguration. Again, pretend to be some random person who just wants to apply for TPU access, fill out your form, then try to do research with the TPUs that are available to you. You’ll have a rough time, but it’ll also cure this misconception that it’s a special case or was just me. Again, no need to take my word for it; here’s an organic comment from someone who was rolling their eyes whenever I was cheerleading TRC, because their experience was so bad: https://news.ycombinator.com/item?id=36936782 I think that the experience is probably great for researchers who get special approval. And that’s fine, if that’s how the program is designed to be. But at least tell people that they shouldn’t expect more than an hour or two of TPU time. |
|
By default, the TRC program grants both on-demand quota and preemptible quota. If you are able to create a TPU VM with your on-demand quota, it should last quite a bit longer than a few hours. (There are situations in which on-demand TRC TPU VMs can be interrupted, but these ought to be rare.) If your on-demand TPU VMs are being interrupted frequently, please email TRC support and provide the names of the TPU hosts that were interrupted so folks can try to help.
When there is very high demand for Cloud TPUs, it's certainly possible for preemptible TPU VMs to be interrupted frequently. It would be an interesting engineering project to make a very robust training system that could make progress even with low TPU VM uptime, and I hope someone does it! Until then, though, you should have a better experience with on-demand resources when you're able to create them. Reserved capacity is even better since it provides an expectation of both availability and uptime.