Hacker News new | ask | show | jobs
by sillysaurusx 1054 days ago
It’s not euw4a. It’s everywhere. The allocation algorithm across the board kills off TPUs after no more than a couple hours. usc1f, usc1a, usc1c, euw4a; it makes no difference.

It would be funny if someone set gpt-2-15b-poetry (our project) in some special way to prevent us from making TPUs that ever last more than a few hours, but from what I’ve heard from other people, this isn’t the case. That’s what I mean about the left hand doesn’t know what’s going on with the right hand. It’s not a misconfiguration. Again, pretend to be some random person who just wants to apply for TPU access, fill out your form, then try to do research with the TPUs that are available to you. You’ll have a rough time, but it’ll also cure this misconception that it’s a special case or was just me.

Again, no need to take my word for it; here’s an organic comment from someone who was rolling their eyes whenever I was cheerleading TRC, because their experience was so bad: https://news.ycombinator.com/item?id=36936782

I think that the experience is probably great for researchers who get special approval. And that’s fine, if that’s how the program is designed to be. But at least tell people that they shouldn’t expect more than an hour or two of TPU time.

1 comments

It sounds like you're primarily using preemptible TPU quota, which doesn't come with any availability or uptime expectations at all.

By default, the TRC program grants both on-demand quota and preemptible quota. If you are able to create a TPU VM with your on-demand quota, it should last quite a bit longer than a few hours. (There are situations in which on-demand TRC TPU VMs can be interrupted, but these ought to be rare.) If your on-demand TPU VMs are being interrupted frequently, please email TRC support and provide the names of the TPU hosts that were interrupted so folks can try to help.

When there is very high demand for Cloud TPUs, it's certainly possible for preemptible TPU VMs to be interrupted frequently. It would be an interesting engineering project to make a very robust training system that could make progress even with low TPU VM uptime, and I hope someone does it! Until then, though, you should have a better experience with on-demand resources when you're able to create them. Reserved capacity is even better since it provides an expectation of both availability and uptime.

I was using on-demand TPUs primarily, and preemptible TPUs secondarily. Neither would last more than an hour or two. And two was something of a minor miracle by the end.
For future reference, the team looked into this, and it appears that the interruptions you experienced were specific to your project and a small number of other projects. The vast majority of TRC projects should see much longer Cloud TPU uptimes when they are able to create on-demand TPUs.

I'm sorry that you had such a frustrating time and that we weren't able to sort it out via email while it was happening. If you decide to try TRC again and run into issues like this, please be sure to engage with TRC support!