Hacker News new | ask | show | jobs
by blensor 13 days ago
I also feel that the GPU/NPU value does not lose money as fast anymore.

What I am wondering though is how long can you run such a system at basically full load without interruption before it starts to just physically degrade.

If I have a H100 and I let it run for 4 years at full throttle does it still have the same theoretical value as it had at the start or are the chips just burning out.

I think I remember that back when the cards used for crypto mining were sold en masse on ebay the advice was to stay away from them because they are more likely to fail?

3 comments

Quite the opposite, GPUs running at a stable rate degrade less than GPU that continuously hit highs and lows (like it would happen on a gaming rig).
Normal use means loading data into the GPU for each batch. The load is not even, though training might be worse than "production".
After digging around a bit I found an unverified claim from 2024 that GPUs in datacenters have a lifespan of 1-3 years

https://www.tomshardware.com/pc-components/gpus/datacenter-g...

Others say that moderate load means a lifespan of ~5 years

Not sure what that means but I would assume that a datacenter will start replacing a node once the error rate hits a certain threshold without really investigating why it failed, so the practical lifespan may be shorter than 5 years even if it would technically still be usable enough

https://en.wikipedia.org/wiki/Electromigration

Temperature is a big factor, as well as current density.

But there's also the # and magnitude of thermal cycles (which translate into mechanical stress, leading to metal-fatigue like effects on contact points etc), attack from chemicals in the air, cosmic radiation, ESD damage & more. Some may matter, some not.

That's why "new" > "used" in case of electronics. Especially since you don't know the (ab)use history of used parts.

> I also feel that the GPU/NPU value does not lose money as fast anymore.

That's because the rate of improvement in silicon manufacturing has been continually declining for a few decades, which has a compounding effect. Just compare the technological improvements in successive decades. 1976->1986->1996->2006->2016->2026.

That's why "in real terms" performance has only been very slowly improving if you compare apples to apples (and not e.g. apples to oranges by reducing precision, like nvidia tends to do, or by comparing chips with x W to an MCM with x*2 W and saying the latter is much faster). The "just halve the number of bits in each generation" strategy has also run out now, there's no more bits to halve.