Hacker News new | ask | show | jobs
by zigzag312 705 days ago
Desktop Ryzen 7950X.

Increasing the VRAM size (UMA size) to 4 GB fixed the frequent driver timeouts for me.

Reverting to older driver (driver cleaner -> driver v23.11.1) fixed the memory leak. This memory leak is weird since PoolMon doesn't show anything unusual. Nothing shows as using too much memory anywhere, except committed memory size grows to over 100GB after few days of uptime and RamMap shows a large amount of unused-active memory.

3 comments

GPUs have the most complex drivers in the whole system, we're talking tens of millions LOCs, so it is absolutely not surprising that you're having issues like that given how recent AMD's investment into APUs is. I wouldn't use them for a few more years; get a cheap discrete GPU from nvidia or maybe even from Intel.
Hm? AMD's investing in APUs is not a new thing, that's going back to the FX days with their FM1 socket. Since Ryzen 1 they have their G APUs, and their integrated graphics power the steamdeck and many other mobile handhelds. Plus, Intel's integrated graphics are known for their driver issues (and so is Arc, for now), so I'd disagree with that recommendation.
APU is not only not a new thing, it’s a marketing term AMD themselves invented over 10 years ago pushing the entire concept of having an iGPU.
The rtx3090 is an Ampere gpu, and will apparently be supported in the new open nVidia driver release.

Should get interesting soon =)

In Nova? Or just the in-kernel component?
Press release:

https://developer.nvidia.com/blog/nvidia-transitions-fully-t...

Yet to personally try it out, but this should eventually enable better integration with the library ecosystems. =3

I have a similar CPU, and I also get frequent iGPU crashes, but only when opening multiple tabs (6+) with video.

I also increased UMA to 4 GB, it reduced the crash frequency, but it still happens.

The discrete NVIDIA GPU I use at the same time is fine.

Please post the cpu-z (win) or cpu-x (linux) chip make/model for other users to compare/search.

If there is enough data here, we may be able to see a common key detail emerge. i.e. if the anecdotal problem(s) remain overtly random, than a solution from the community or OEM may prove impossible.

Thanks in advance, =3

I initially got somewhat frequent hangs on Fedora with a Radeon 680M iGPU (in a Ryzen 7 PRO 6850U APU). The hangs stopped when I added amdgpu.dcdebugmask=0x10 to kernel boot options, based on some comments in an AMD Linux driver bug report [1]. That seems to disable panel self-refresh so it would seem to be related to that somehow.

Stability has been fine since. The bug report has since been closed but I haven't tested in a while to see if disabling PSR is still needed or if the issue has actually been fixed.

I haven't seen significant stability issues on Windows, although I don't use it much on the AMD device.

[1] https://gitlab.freedesktop.org/drm/amd/-/issues/2443

Is that Wayland or Xorg?

With PSR in the mix, is the system really hanging or is it just failing to update the screen somehow? I.e. can you tell the difference with logs or a remote connection or configure and use an unprompted shutdown via the power button?

It was on Wayland. I'm not sure if I tried with X.

I can't remember the details of it. It effectively hung in the sense that I couldn't get the system into a usable state again locally without rebooting. I'm not sure if the system responded to the power button or not, or whether there was useful log output.

I didn't bother trying with a remote connection since the hang was frequent enough that it wouldn't have been of any use as a workaround anyway. I'd guess switching to another virtual console probably didn't work because I'd probably remember it if it did.

I can try re-enabling PSR and see if the problem is still there if you're interested.

Looks like some of the patches discussed in that bug report work around the problem by disabling PSR-SU for the specific timing controller my display also has. Those patches are in current kernels already. So basically the problem is gone for me, even if I remove the dcdebugmask.

So, I don't really know if the system was fully hanging, or if the display was just unable to update any more, but it was likely exactly the same that happened to other people with Parade TCONs in that bug discussion.

Thanks for contributing.

Your tip may help some folks in the future. =)

Please pull the chip maker/model and ram details off your rig:

sudo apt-get install cpu-x

sudo cpu-x

I think comparing your specifications may help other users narrow down if a manufacturing or software defect is present.

Thanks in advance =3