Hacker News new | ask | show | jobs
by nl 1055 days ago
> You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

> Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

Unless I'm misreading this they sound pretty happy and you sound pessimistic? Their last substantial comment was "I'm sure Zak could hook you up with something better"?

1 comments

TRC is supposed to be the “something better”. This insider TPU stuff is for the birds. If TRC can only offer 4 hours with no preemptions, that’s fine, but they need to be up front about that. Saying that TPUs preempt every 24 hours and then killing them off after 45 minutes is… not very productive.

As for their comments, the third screenshot is the key; they’re agreeing that the situation is bad. They’re a friend, and they’re a little indirect with the way they phrase things. (If you’ve ever had a friend who really doesn’t want to be wrong, you know what I mean; they kind of say things in a circular way in order to agree without agreeing. After awhile it’s pretty cute and endearing though.)

I was particularly pessimistic in those DMs because it came a couple months after I thought I’d give TRC one last try, back in January, which was roughly a year after I’d started my “ok, I’m losing hope, but I’ll wait and see” journey. In the meantime I kept cheerleading TRC and driving people to their signup page. But after the TPUs all died in less than two hours yet again, that was that.

I have a really high tolerance for faulty equipment. This is free compute; me complaining is just ungrateful. But I saw what things were like in 2019. “Different” would be the understatement of the century. If my baby wasn’t being incubated in the NICU today, I’d show the charts where our usage went from thousands of cores down to almost zero, and not for lack of trying.

It also would’ve been fine to say “sorry, this is unsustainable, the new limits are one tpu per person per project” and then give me a rock solid tpu. We had those in 2021. One of our TPUv3s stayed online for so long that I started to host my blog on it just to show people that TPUs were good for more than AI; the uptime was measured in months. Then poof, now you can barely fire one up.

I don't have a qualified opinion on the subject of TPU availability.

I'm just pointing out that your summary of the DMs ("Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying") is the opposite of what the DMs show.

As mentioned in another comment, it sounds like you're using preemptible TRC TPU quota. If you use on-demand TRC TPU quota instead, that should improve your uptime substantially.
This is totally fascinating.

Frankly, it sounds to me like they're having severe yield+reliability problems with the TPUv4s that aren't getting caught by wafer-level testing, and have binned the flakiest ones for use by outsiders.

A lot of yield issues show up as spontaneous resets/crashes.

It's more likely Google preempting researcher who are on a preemptable research grant, and it is happening a lot more often because there are more paying customers.
"Preemptable money" sounds like the kind of bullshit I would use to cover up failed chips. And yes, I am a VLSI engineer.