| Zak, I love you buddy, but you should have some of your researchers try to use the TRC program. They should pretend to be a nobody (like I was in 2019) and try to do any research with the resources they’re granted. I guarantee you those researchers will all tell you “we can’t start any training runs anymore because the TPUs die after 45 minutes.” This may feel like an anime betrayal, since you basically launched my career as a scientist. But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem, especially today. And TRC just does not support them anymore. I tried, many times, over the last year and a half. You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying. I held out hope for so long. I thought it was temporary. It ain’t temporary, Zak. And I vividly remember when it happened. Some smart person in google proposed a new allocation algorithm back near the end of 2021, and poof, overnight our ability to make TPUs went from dozens to a handful. It was quite literally overnight; we had monitoring graphs that flatlined. I can probably still dig them up. I’ve wanted to email you privately about this, but given that I am a small fish in a pond that’s grown exponentially bigger, I don’t think it would’ve made a difference. The difference is in your last paragraph: you allocate reserved instances to those who deserve it, and leave everybody else to fight over 45 minutes of TPU time when it takes 25 minutes just to create and fill your TPU with your research data. Your non-preemptible TPUs are frankly a lie. I didn’t want to drop the L word, but a TPUv3 in euw4a will literally delete itself — aka preempt — after no more than a couple hours. I tested this over many months. That was some time ago, so maybe things have changed, but I wouldn’t bet on it. There’s some serious “left hand doesn’t know that right hand detached from its body and migrated south for the winter” energy in the TRC program. I don’t know where it embedded itself, but if you want to elevate any other engineers from software devs to researchers, I urge you to make some big changes. One last thing. The support staff of TRC is phenomenal. Jonathan Colton has worked more miracles than I can count, along with the rest of his crew. Ultimately he had to send me an email like “by the way, TRC doesn’t delete TPUs. This distinction probably won’t be too relevant, but I wanted to let you know” (paraphrasing). Translation: you took the power away from the people who knew where to put it (Jonathan) and gave it to some really important researchers, probably in Brain or some other division of Google. And the rest is history. So I don’t want to hear that one of the changes is “ok, we’ve punished the support staff” - as far as I can tell, they’ve moved mountains with whatever tools they had available, and I definitely wouldn’t have been able to do any better in their shoes. Also, hello. Thanks for launching my career. Sorry that I had to leave this here, but my duty is to the open source community. The good news is that you can still recover, if only you’d revert this silly “we’ll slip you some reserved TPUs that don’t kamikaze themselves after 45 minutes if you ask in just the right way” stuff. That wasn’t how the program was in 2019, and I guarantee that I couldn’t have done the work I did then under the current conditions. |
> But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem
Totally agree! This was a big part of my original motivation for creating the TPU Research Cloud program. People sometimes assume that e.g. an academic affiliation is required to participate, but that isn't true; we want the program to be as open as possible. We should find a better way to highlight the work of TRC tinkerers - for now, the GitHub and Hugging Face search buttons near the top of https://sites.research.google/trc/publications/ provide some raw pointers.
I'm sorry to hear that you've personally had a hard time getting TPU v3 capacity in europe-west4-a. In general, TRC TPU availability varies by region and by hardware generation, and we've experimented with different ways of prioritizing projects. It's possible that something was misconfigured on our end if your TPU lifetimes were so short. Could you email Jonathan the name of the project(s) you were using and any other data you still have handy so we can figure out what was going wrong?
Also, thanks for the kind words for Jonathan and the rest of the TRC team. They haven't lost any power or control, and they are allocating a lot more Cloud TPU capacity than ever. However, now that everyone wants to train LLMs, diffusion models, and other exciting new things, demand for TPU compute is way up, so juggling all of the inbound TRC requests is definitely more challenging than it used to be.