Hacker News new | ask | show | jobs
by zak 1050 days ago
A few quick comments:

> But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem

Totally agree! This was a big part of my original motivation for creating the TPU Research Cloud program. People sometimes assume that e.g. an academic affiliation is required to participate, but that isn't true; we want the program to be as open as possible. We should find a better way to highlight the work of TRC tinkerers - for now, the GitHub and Hugging Face search buttons near the top of https://sites.research.google/trc/publications/ provide some raw pointers.

I'm sorry to hear that you've personally had a hard time getting TPU v3 capacity in europe-west4-a. In general, TRC TPU availability varies by region and by hardware generation, and we've experimented with different ways of prioritizing projects. It's possible that something was misconfigured on our end if your TPU lifetimes were so short. Could you email Jonathan the name of the project(s) you were using and any other data you still have handy so we can figure out what was going wrong?

Also, thanks for the kind words for Jonathan and the rest of the TRC team. They haven't lost any power or control, and they are allocating a lot more Cloud TPU capacity than ever. However, now that everyone wants to train LLMs, diffusion models, and other exciting new things, demand for TPU compute is way up, so juggling all of the inbound TRC requests is definitely more challenging than it used to be.

1 comments

It’s not euw4a. It’s everywhere. The allocation algorithm across the board kills off TPUs after no more than a couple hours. usc1f, usc1a, usc1c, euw4a; it makes no difference.

It would be funny if someone set gpt-2-15b-poetry (our project) in some special way to prevent us from making TPUs that ever last more than a few hours, but from what I’ve heard from other people, this isn’t the case. That’s what I mean about the left hand doesn’t know what’s going on with the right hand. It’s not a misconfiguration. Again, pretend to be some random person who just wants to apply for TPU access, fill out your form, then try to do research with the TPUs that are available to you. You’ll have a rough time, but it’ll also cure this misconception that it’s a special case or was just me.

Again, no need to take my word for it; here’s an organic comment from someone who was rolling their eyes whenever I was cheerleading TRC, because their experience was so bad: https://news.ycombinator.com/item?id=36936782

I think that the experience is probably great for researchers who get special approval. And that’s fine, if that’s how the program is designed to be. But at least tell people that they shouldn’t expect more than an hour or two of TPU time.

It sounds like you're primarily using preemptible TPU quota, which doesn't come with any availability or uptime expectations at all.

By default, the TRC program grants both on-demand quota and preemptible quota. If you are able to create a TPU VM with your on-demand quota, it should last quite a bit longer than a few hours. (There are situations in which on-demand TRC TPU VMs can be interrupted, but these ought to be rare.) If your on-demand TPU VMs are being interrupted frequently, please email TRC support and provide the names of the TPU hosts that were interrupted so folks can try to help.

When there is very high demand for Cloud TPUs, it's certainly possible for preemptible TPU VMs to be interrupted frequently. It would be an interesting engineering project to make a very robust training system that could make progress even with low TPU VM uptime, and I hope someone does it! Until then, though, you should have a better experience with on-demand resources when you're able to create them. Reserved capacity is even better since it provides an expectation of both availability and uptime.

I was using on-demand TPUs primarily, and preemptible TPUs secondarily. Neither would last more than an hour or two. And two was something of a minor miracle by the end.
For future reference, the team looked into this, and it appears that the interruptions you experienced were specific to your project and a small number of other projects. The vast majority of TRC projects should see much longer Cloud TPU uptimes when they are able to create on-demand TPUs.

I'm sorry that you had such a frustrating time and that we weren't able to sort it out via email while it was happening. If you decide to try TRC again and run into issues like this, please be sure to engage with TRC support!