Hacker News new | ask | show | jobs
by topspin 301 days ago
As there is ongoing drama with Zen 5 and power issues, there are people with the instruments and the motivation to investigate this. You should consider contacting Gamers Nexus, and help them to get your test suite running. They can measure power draw and do a thermal analysis of this CPU, and they'd likely be eager to do it, given the possibility of making a bunch of dramatic YouTube content about design flaws in widely used hardware. That's pretty much their whole schtick in recent years.

> Modern CPUs measure their temperature and clock down if they get too hot, don't they?

Yes. It's rather complex now and it involves the motherboard vendor's firmware. When (not if) they get that wrong CPUs burn up. You're going to need some expertise to analyze this.

7 comments

> [...] a bunch of dramatic YouTube content [...]

That framing doesn't do him and the team justice. There is (or better, was) a 3.5h long story about NVIDIA GPUs finding their ways illegaly from the US to China, which got taken down by a malicious DMCA claim from Bloomberg. It is quite interesting to watch (Can be found archive.org).

GN is one of the last pro-consumer outlets, that keep on digging and shaking the tree big companys are sitting on.

For the record, I think GN is excellent and highly credible.
That copy is missing the chapters. Here they are:

00:00:00 - The NVIDIA AI GPU Black Market

00:06:06 - WE NEED YOUR HELP

00:07:41 - A BIG ADVENTURE

00:10:10 - Ignored by the US

00:11:46 - BACKGROUND: Why They're Banned

00:16:04 - TIMELINE

00:21:32 - H20 15 Percent Revenue Share with the US

00:26:01 - Calculating BANNED GPUs

00:29:31 - OUR INFORMANTS

00:31:47 - THE SMUGGLING PIPELINE

00:33:39 - PART 1: HONG KONG Demand Drivers

00:43:14 - PART 1: How Do Suppliers Get the GPUs?

00:48:18 - PART 1: GPU Rich and GPU Poor

00:56:19 - PART 1: DATACENTER with Banned GPUs, AMD, Intel

01:06:19 - PART 1: Chinese Military, Huawei GPUs

01:09:48 - PART 1: How China Circumvents the Ban

01:19:30 - PART 1: GPU MARKET in Hong Kong

01:32:39 - WIRING MONEY TO CHINA

01:36:29 - PART 2: CHINA Smuggling Process

01:43:26 - PART 3: SHENZHEN's GPU MIDDLEMEN

01:50:22 - PART 3: AMD and INTEL GPUs Unwanted

01:56:34 - PART 4: THE GPU FENCE

02:06:01 - PART 4: FINDING the GPUs

02:15:12 - PART 4: THE FIXER IC Supplier

02:21:12 - PART 5: GPU WAREHOUSE

02:27:17 - PART 6: CHOP SHOP and REPAIR

02:34:52 - PART 6: BUILD a Custom AI GPU

02:56:33 - PART 7: FACTORY

03:01:01 - PART 8: TAIWAN and SINGAPORE Intermediaries

03:02:06 - PART 9: SMUGGLER

03:05:11 - LEGALITY of Buying and Selling

03:08:05 - CORRUPTION: NVIDIA and Governments

03:26:51 - SIGNOFF

Given that Gamers Nexus needs the ad revenue wouldn't linking to a re-upload that wouldn't give them any of that be sort of bad?
While they can't publish it themselves this at least achieves the goal of the information being spread, along with the knowledge that it was their investigative team that did the work in the first place.

But yes, once they reedit and republish themselves (or manage some sort of appeal and republish as-is) then of course linking to that (and a smaller cut of the parts they've had to change because Bloomberg were litigious arseholes, if only to highlight that their copyright claim here is somewhat ridiculous) would be much better.

It sounds like their lawyers have done the appropriate counter-challenge with YouTube, so the video will go back up unless Bloomberg sues them in the next so many days. And this is Gamers Nexus, so I presume they will fight to keep it as is on principle.

Personally, I found the length of the quotes from politicians kind of tedious, but I sure wouldn’t want them to capitulate to Bloomberg after this.

They do not "need" it since that movie was crowd funded with over 400k anyways and AdSense are pittance in comparison. They also have indirectly promoted that reupload.
YouTube ad revenue isn't as high as you'd think. A very significant part of their income comes from in-video sponsors and merchandise sales.
>Given that Gamers Nexus needs the ad revenue

They made six figures from merch sales on that investigation. Not much, but more than Youtube ads.

Agreed. I'm waiting until they get it back up to watch it. I can wait.
buy a mug from them to support them!
When something is uploaded to the internet, it won't be easy to take it down.

Ask Beyonce.

Or Barbra Streisand
Can you explain further about Beyoncé? Do you mean the elevator video where her sister attacks Jay Z?
Haha I love that one. She should have just leaned into it and laughed.
There is a picture on internet that Beyonce and her lawyers don't like. They tried to remove it from the internet.

You guess the result.

The small coolers used by them are not recommended by Noctua for 9950X. Noctua recommends only bigger coolers for 9950X, which dissipates 200 W permanently on a workload like theirs (which is much less than the over 250 W dissipated in similar conditions by the competing Intel CPUs).

Despite this, the overtemperature protection of the CPUs should have protected the CPUs and prevent any kind of damage like this.

Besides the system that varies continuously the clock frequency to keep the CPU within the current and power consumption limits, there is a second protection that stops temporarily the clock when a temperature threshold is exceeded. However, the internal temperature sensors of the CPUs are not accurate, so the over-temperature protection may begin to act only at a temperature that is already too high.

So these failures appear to have been caused by a combination of not using the appropriate coolers for a 200 W CPU, combined with the fact that AMD advertises a 200-W CPU as an 170-W CPU, fooling naive customers into believing that smaller coolers are acceptable, and with either some kind of malfunction of the over-temperature protection in these CPUs or with a degradation problem that happens even within the nominal temperature range, but at its upper end.

> The small coolers used by them are not recommended by Noctua for 9950X

Noctua's CPU compatibility page lists the NH-U9s as "medium turbo/overclocking headroom" for the 9950X [0]. I don't think it's fair to suggest their cooler choice is the problem here.

[0] https://ncc.noctua.at/cpus/model/AMD-Ryzen-9-9950X-1831

That means pretty much "not recommended".

On the same page linked by you, Noctua explains that the green check mark means that with that cooler the CPU can run all-core intensive tasks, exactly like those used by the gmplib developers, only at the base clock, which is 4.3 GHz for 9950X, with turbo disabled in BIOS.

Only then the CPU might dissipate its nominal TDP of 170 W, instead of the 200 W that it dissipates with turbo enabled.

With "best turbo headroom", you can be certain that the CPU can run all-core intensive tasks with turbo enabled. Even if you do no overclocking, but you run all-core intensive tasks with turbo enabled, this is the kind of cooler that you need.

Noctua does not define what "medium headroom" means, but presumably it means that you can run with turbo enabled all-core tasks that have medium intensity, not maximum intensity.

There is no doubt that it is a mistake to choose such a cooler when you intend to run intensive multi-threaded computations. A better cooler, but not much bigger, like NH-U12A, has an almost double cooling capacity.

That said, there is also no doubt that AMD is guilty of at least having some bugs in their firmware or in failing to provide adequate documentation for the motherboard manufacturers that adapt the AMD firmware for their MBs.

It is important to remember that CPUs scale their turbo with thermals. It's not a matter of needing to turn turbo on and off
Wendell at Level1Techs often goes more in-depth on the software testing and datacenter use-case analysis through partnerships with friends who run lots of machines in datacenters.

GN is unique in paying for silicon-level analysis of failures.

der8auer also contributes a lot to these stories.

I tend to wait for all 3 of their analyses, because each adds a different "hard-won" perspective.

He's a bit sensationalist, yes, but I am thankful that he saved us from buying affected Intel CPUs.
He's a "student" and friend of late Gordon Mah Ung. He's carrying his torch forward.

This was Gordon's style, and Steve is continuing it. He has the courage to hit Bloomberg offices with a cameraman, so I don't think his words ring hollow.

We need that kind of in your face, no punches held back type of reporting when compared to "measured professionals".

Absolutely - this is the sort of direct citizen journalism I expect (sort of hope?) we'll see more and of as traditional investigative journalism dies its slow death.
Yes. When he's right, he's right. However the main issue I have with GN is how Steve tends to go full Leeroy Jenkins pitchforks and torches for 9 out of every 5 actual scandals in the tech industry.
When it comes to interpersonal drama, the "Shoot first, ask questions later" style of reporting is terrible. However, for consumer advocacy it's basically the opposite, especially because in most cases it's easy for companies to turn the narrative around by simply handling the issue well. It's almost more about how they handle it than the actual issue in many cases.
I felt the same way, but over time I have come to respect those with the Crusader personality archetype, we need these people to do their thing and they need us to balance them out.
Not sure of sensationalist or just doing great reporting. I take him as one of the last good tech journalists on the platform.
GN wasn't the first to break the story the 13/14th gen was defective. The thousands and thousands of users experiencing the issues collectively noticed pretty quick. If anything, there was a period where he was saying "We've talked to Intel but we won't say anything yet until they do."
AMD has failed to be reliable with its Zen 4 and Zen 5 consumer CPUs, just at the same time Intel did the same with their 13k and 14k higher end CPUs.

AMD is somewhat worse than Intel as their DDR5 memory bus is very "twitchy" making it hard to get the highest DDR5 timings, especially with multiple DIMMs per channel.

I don't think it's reasonable to call memory timing tweaking stability issues worse than a cpu dying from heat under normal usage.
I had to put together an AM5 computer pretty quickly after I accidentally fried some components in my last computer, so I got a Microcenter bundle.

I got 2x32GB sticks of RAM with the plan to throw in another two sticks later. I had no idea that was now a bad plan. I wish manufacturers would have just put 2 DIMM slots on motherboards as a “warning.”

I think that's just a result of being at the limit of what a right-angle memory slot can handle, it's about time that desktop move to CAMM or soldered memory
What do you mean? Is your second sentence the only reason for the first?
They don't say what temperature the CPU was reporting which seems like an odd omission. Whatever the specs of your cooler etc check the temperature it's actually running at. Go by what the CPU is saying! I've got the older 3950x, and the first one died after a few months (still in warranty) with a cooler in spec, but it would go into the 90s at full load just doing big builds. I replaced the heatsink with a basic watercooler when the replacement chip arrived and it's running at least 20c cooler at full load.
A modern CPU should be able to detect temperature excursions and bring itself to a safe halt even if you power it up without any cooler attached. It's normal and expected that people making mistakes around the cooling systems of their CPUs will accidentally give themselves terrible performance. It is not normal that the CPUs will break.
Zen 2 is supposed to be able to work up to 95 C so that shouldn't have caused your CPU to fail. And it should clock down before it fails anyway, way below the specified "minimum" frequency if needed - got to experience that with a failing AIO. A better cooler should only be required to make full use of your CPU not to protect it.
I kind of agree with you and Symmetry; but having had a fried CPU I'm more careful. No electronics like running very hot - so even if you're just inside spec on something for the heat it's likely to live a shorter life than if you kept it more comfortable - and it'll let it clock faster if you keep it cool! And really my points are: * the standard spec coolers just don't manage that on these hot CPUs, even if they claim to. * If you're building a machine and you know you're pushing it hard, just check the temperatures to check that cooling you bought is working.
Maybe they didn't have anything logging the temperature. They didn't expect it to die after all.
All you really need to see is the picture of the CPU with thermal paste only on one half. Thermal throttling is tuned to work when there is 1. a sufficient heatsink (theirs was significantly below requirements) and 2. it is installed correctly so that its triggers for downclocking happen with the correct timing. This is just another instance of ridiculous PEBCAK error
This is per design. On AM5 processors, there's a hotspot on the lower half of the processor where the dies that contain the CPU cores are located. Noctua recommends that AM5 users mount their coolers shifted towards the lower side of the processor for optimal cooling performance, see https://noctua.at/en/offset-am5-mounting-technical-backgroun... . You may have missed the paragraph in the article that explicitly points this out:

> We use a Noctua cooling solution for both systems. For the 1st system, we mounted the heat sink centred. For the 2nd system, we followed Noctua's advice of mounting things offset towards what they claim to be the hotter side of the CPU. Below is a picture of the 2nd system without the heat sink which shows that offset. Note the brackets and their pins, those pins are where the heat sink's pressure gets centred. Also note how the thermal paste has been squeezed away from that part, but is quite thick towards the left.

While it is noctua advice, I don't think AMD supports that view, so it would seem correct to at least test the cpu the way the vendor recommends before making conclusions
You may have missed the part in the article that says that they only switched to offset mounting after their first Ryzen 9950X died when the cooler was mounted centered.

> But note that the 1st failure happened with a more centred heat sink. We only made the off-centre mounting for the 2nd system as to minimise the risk of a repeated system failure.

Noctua recommends mounting their cooler so that the center is shifted toward the lower part of the CPU. From your picture with the thermal paste, it’s clear that your cooler is only making contact with about two-thirds of the CPU, meaning you mounted it incorrectly. The cooler’s contact area must always cover the entire CPU; otherwise, you reduce heat transfer capacity. On top of that, you’re already using an undersized cooler for this CPU. I think you don’t understand the basics of thermodynamics.
Welcome to Hacker News! I'm glad my comment encouraged you to join the site.

I didn't write the article, I was just commenting because other users seemed to miss information that was written in it.

The picture with the thermal paste shows that paste was squeezed out from the entire perimeter of the CPU, so the cooler is making contact with the whole CPU. The paste is squeezed thinner near the lower side of the CPU because that's where the mounting pins are located, meaning that's where the mounting pressure is the strongest. The impression left by the thermal paste matches the diagram on Noctua's site ( https://noctua.at/pub/media/wysiwyg/offset/heat_cooler_base_... ).

Noctua lists the NH-U9S cooler as being compatible with the 9950X, and claims it has "medium turbo/overclocking headroom", see https://ncc.noctua.at/cpus/model/AMD-Ryzen-9-9950X-1831 . I'm not sure how they come up with their compatibility ratings, but I generally trust Noctua knows what they're doing when it comes to CPU cooling.

It's also important to note that the author only tried the offset mount after they had a CPU die when the cooler was mounted centered on the CPU.

Overall, I think it's unlikely that these failures can be blamed on poor cooling.

i'm not sure what image you're looking at, but the picture in question here most certainly shows a CPU that did not have a properly mounted heatsync (to a very severe degree)
Clearly paste was squeezed out from the entire perimeter of the CPU. Offset mounting is used intentionally for this CPU.

Probably there's less paste remaining on the south end of the CPU because that's where the mounting force is greatest.

If anything, there's too much paste remaining on the center/north end of the CPU. Paste exists simply to bridge the roughness of the two metal surfaces, too much paste is a bad sign.

My guess is that the MB was oriented vertically and that big heavy heat sink with the large lever arm pulled it away from the center and north side of the CPU.

IMO, the CPU is still responsible for managing its power usage to live a long life. The only effect of an imperfect thermal solution ought to be proportionally reduced performance.

Many reviewers have tested that too much paste is not an issue, except being messy to clean.
The experiments comparing different paste and application methods I've seen only make 1-2 degree C difference. Which enthusiasts might care alot about, but most people wouldn't notice.
I'm not as sure about AMD CPUs (and they were known for having far worse overheat behaviour back in the early 2000s) but there are plenty of stories of Intel CPUs working for many years, sitting at the thermal limits, with the (stock) heatsink not even in contact, thanks to their cheap push-pin retention mechanism.
Those dreadful plastic knobs never want to sit right. Simple lever over that shit any time of day, pls.