Hacker News new | ask | show | jobs
by Joel_Mckay 705 days ago
Sounds like bad ram (clean contacts, re-seat, and test) or temperature issues (the main reason we still use mobile i7-12700H was cheap ddr4 64GB ram stick kit, Iris media gpu drivers, and rtx CUDA gpu.)

Intel has its own issues, Gigabyte told me to pound sand when asking to unlock the bios on my own equipment to disable IME.

There is no greener grass on the fence line... just a different set of issues =3

2 comments

>Sounds like bad ram (clean contacts, re-seat, and test)

Since he's taking about iGPU issues, he most likely has a laptop APU, so no RAM to reseat. I'm also having similar issues on my Ryzen 7000 laptop. Kinda regret upgrading from the Ryzen 5000 laptop which AMD obsoleted just 2 years after I bought it, as at least that had no issues. Hopefully new drivers in the future will fix stability but you never know.

What I do know, is that this will most likely be my last AMD machine if Intel shows improvement to match AMD, since their Linux driver support is just top notch.

Desktop Ryzen 7950X.

Increasing the VRAM size (UMA size) to 4 GB fixed the frequent driver timeouts for me.

Reverting to older driver (driver cleaner -> driver v23.11.1) fixed the memory leak. This memory leak is weird since PoolMon doesn't show anything unusual. Nothing shows as using too much memory anywhere, except committed memory size grows to over 100GB after few days of uptime and RamMap shows a large amount of unused-active memory.

GPUs have the most complex drivers in the whole system, we're talking tens of millions LOCs, so it is absolutely not surprising that you're having issues like that given how recent AMD's investment into APUs is. I wouldn't use them for a few more years; get a cheap discrete GPU from nvidia or maybe even from Intel.
Hm? AMD's investing in APUs is not a new thing, that's going back to the FX days with their FM1 socket. Since Ryzen 1 they have their G APUs, and their integrated graphics power the steamdeck and many other mobile handhelds. Plus, Intel's integrated graphics are known for their driver issues (and so is Arc, for now), so I'd disagree with that recommendation.
APU is not only not a new thing, it’s a marketing term AMD themselves invented over 10 years ago pushing the entire concept of having an iGPU.
The rtx3090 is an Ampere gpu, and will apparently be supported in the new open nVidia driver release.

Should get interesting soon =)

In Nova? Or just the in-kernel component?
Press release:

https://developer.nvidia.com/blog/nvidia-transitions-fully-t...

Yet to personally try it out, but this should eventually enable better integration with the library ecosystems. =3

I have a similar CPU, and I also get frequent iGPU crashes, but only when opening multiple tabs (6+) with video.

I also increased UMA to 4 GB, it reduced the crash frequency, but it still happens.

The discrete NVIDIA GPU I use at the same time is fine.

Please post the cpu-z (win) or cpu-x (linux) chip make/model for other users to compare/search.

If there is enough data here, we may be able to see a common key detail emerge. i.e. if the anecdotal problem(s) remain overtly random, than a solution from the community or OEM may prove impossible.

Thanks in advance, =3

I initially got somewhat frequent hangs on Fedora with a Radeon 680M iGPU (in a Ryzen 7 PRO 6850U APU). The hangs stopped when I added amdgpu.dcdebugmask=0x10 to kernel boot options, based on some comments in an AMD Linux driver bug report [1]. That seems to disable panel self-refresh so it would seem to be related to that somehow.

Stability has been fine since. The bug report has since been closed but I haven't tested in a while to see if disabling PSR is still needed or if the issue has actually been fixed.

I haven't seen significant stability issues on Windows, although I don't use it much on the AMD device.

[1] https://gitlab.freedesktop.org/drm/amd/-/issues/2443

Is that Wayland or Xorg?

With PSR in the mix, is the system really hanging or is it just failing to update the screen somehow? I.e. can you tell the difference with logs or a remote connection or configure and use an unprompted shutdown via the power button?

Thanks for contributing.

Your tip may help some folks in the future. =)

Please pull the chip maker/model and ram details off your rig:

sudo apt-get install cpu-x

sudo cpu-x

I think comparing your specifications may help other users narrow down if a manufacturing or software defect is present.

Thanks in advance =3

Depends on the failure mode, as it is common for specs to drift around under load (also, temperature cycling stresses PCB, and can shear BGA connections.)

I'd try a slower cheap set of lower-bandwidth/higher-latency ram sticks to see if it stops glitching up. If you are using low latency sticks (iGPU means this is usually recommended), than dropping the performance a bit may stabilize your specific equipment.

Of course, I'm not that smart... so YMMV... =3

There are no sticks in my laptop. I was taking about soldered RAM as is he norm on recent high speed LPDDR5X laptops.
Please pull the chip maker/model off your rig:

sudo apt-get install cpu-x

sudo cpu-x

We may still be able to use this information to compare with other users glitches to see if there is some underlying similarity.

Unfortunately, if it is a thermal stress/warping on the PCB cracking open RAM BGA balls on chips or shifting traces... One won't really be able to completely identify the intermittent issue.

We were actually looking at buying a similar economy model earlier this year (ended up with a few classic Lenovo models instead)... so please be verbose with the make/model to help future searchers =3

Can't be thermal, I checked.
X-ray vision like Superman I gather... nice... ;)

Please dump the problematic cpu/ram chip model numbers to help other users. These chip manufacturer numbers is not really personally identifiable information, as they are shared between hundreds of thousands of products.

The classic cpu-z for Windows users is here if you don't run *nix:

https://www.cpuid.com/softwares/cpu-z.html

Best regards, =3

I did ~12h RAM test few times and it always passed successfully (except when I was testing EXPO profile on early BIOS version).

I also did Prime95 CPU stress testing a few times without issues.

All issues seem to be related to either BIOS or drivers.

Pleas join the branch discussing the idea of using slower/cheaper RAM.

What is your current ram chip model, maker, and configuration on your machine?

sudo apt-get install cpu-x

sudo cpu-x

Cheers, =3

Corsair Vengeance 64GB (2x32GB) 5600MHz C36. Module Part Number: CMH64GX5M2B5600C36. DRAM manufactured by Samsung.

Running RAM at default speeds (4800MHz) or using XMP profile 5600MHz C36 doesn't affect these issues (they are no more or less frequent).

EDIT: XMP profile, not EXPO.

Thanks for helping the other users =3
Some more info if it helps anyone:

CPU Ryzen 9 7950X. Family: F (ext.: 19), Model: 1 (ext.: 61), Stepping: 2, Revision: RPL-B2.

iGPU: Raphael, revision: C1.

MB: ASUS TUF Gaming X670E-PLUS WiFi. Rev 1.xx. Southbridge rev.: 51.