Hacker News new | ask | show | jobs
by Rinzler89 705 days ago
>Sounds like bad ram (clean contacts, re-seat, and test)

Since he's taking about iGPU issues, he most likely has a laptop APU, so no RAM to reseat. I'm also having similar issues on my Ryzen 7000 laptop. Kinda regret upgrading from the Ryzen 5000 laptop which AMD obsoleted just 2 years after I bought it, as at least that had no issues. Hopefully new drivers in the future will fix stability but you never know.

What I do know, is that this will most likely be my last AMD machine if Intel shows improvement to match AMD, since their Linux driver support is just top notch.

2 comments

Desktop Ryzen 7950X.

Increasing the VRAM size (UMA size) to 4 GB fixed the frequent driver timeouts for me.

Reverting to older driver (driver cleaner -> driver v23.11.1) fixed the memory leak. This memory leak is weird since PoolMon doesn't show anything unusual. Nothing shows as using too much memory anywhere, except committed memory size grows to over 100GB after few days of uptime and RamMap shows a large amount of unused-active memory.

GPUs have the most complex drivers in the whole system, we're talking tens of millions LOCs, so it is absolutely not surprising that you're having issues like that given how recent AMD's investment into APUs is. I wouldn't use them for a few more years; get a cheap discrete GPU from nvidia or maybe even from Intel.
Hm? AMD's investing in APUs is not a new thing, that's going back to the FX days with their FM1 socket. Since Ryzen 1 they have their G APUs, and their integrated graphics power the steamdeck and many other mobile handhelds. Plus, Intel's integrated graphics are known for their driver issues (and so is Arc, for now), so I'd disagree with that recommendation.
APU is not only not a new thing, it’s a marketing term AMD themselves invented over 10 years ago pushing the entire concept of having an iGPU.
The rtx3090 is an Ampere gpu, and will apparently be supported in the new open nVidia driver release.

Should get interesting soon =)

In Nova? Or just the in-kernel component?
Press release:

https://developer.nvidia.com/blog/nvidia-transitions-fully-t...

Yet to personally try it out, but this should eventually enable better integration with the library ecosystems. =3

I have a similar CPU, and I also get frequent iGPU crashes, but only when opening multiple tabs (6+) with video.

I also increased UMA to 4 GB, it reduced the crash frequency, but it still happens.

The discrete NVIDIA GPU I use at the same time is fine.

Please post the cpu-z (win) or cpu-x (linux) chip make/model for other users to compare/search.

If there is enough data here, we may be able to see a common key detail emerge. i.e. if the anecdotal problem(s) remain overtly random, than a solution from the community or OEM may prove impossible.

Thanks in advance, =3

I initially got somewhat frequent hangs on Fedora with a Radeon 680M iGPU (in a Ryzen 7 PRO 6850U APU). The hangs stopped when I added amdgpu.dcdebugmask=0x10 to kernel boot options, based on some comments in an AMD Linux driver bug report [1]. That seems to disable panel self-refresh so it would seem to be related to that somehow.

Stability has been fine since. The bug report has since been closed but I haven't tested in a while to see if disabling PSR is still needed or if the issue has actually been fixed.

I haven't seen significant stability issues on Windows, although I don't use it much on the AMD device.

[1] https://gitlab.freedesktop.org/drm/amd/-/issues/2443

Is that Wayland or Xorg?

With PSR in the mix, is the system really hanging or is it just failing to update the screen somehow? I.e. can you tell the difference with logs or a remote connection or configure and use an unprompted shutdown via the power button?

It was on Wayland. I'm not sure if I tried with X.

I can't remember the details of it. It effectively hung in the sense that I couldn't get the system into a usable state again locally without rebooting. I'm not sure if the system responded to the power button or not, or whether there was useful log output.

I didn't bother trying with a remote connection since the hang was frequent enough that it wouldn't have been of any use as a workaround anyway. I'd guess switching to another virtual console probably didn't work because I'd probably remember it if it did.

I can try re-enabling PSR and see if the problem is still there if you're interested.

Thanks for contributing.

Your tip may help some folks in the future. =)

Please pull the chip maker/model and ram details off your rig:

sudo apt-get install cpu-x

sudo cpu-x

I think comparing your specifications may help other users narrow down if a manufacturing or software defect is present.

Thanks in advance =3

Depends on the failure mode, as it is common for specs to drift around under load (also, temperature cycling stresses PCB, and can shear BGA connections.)

I'd try a slower cheap set of lower-bandwidth/higher-latency ram sticks to see if it stops glitching up. If you are using low latency sticks (iGPU means this is usually recommended), than dropping the performance a bit may stabilize your specific equipment.

Of course, I'm not that smart... so YMMV... =3

There are no sticks in my laptop. I was taking about soldered RAM as is he norm on recent high speed LPDDR5X laptops.
Please pull the chip maker/model off your rig:

sudo apt-get install cpu-x

sudo cpu-x

We may still be able to use this information to compare with other users glitches to see if there is some underlying similarity.

Unfortunately, if it is a thermal stress/warping on the PCB cracking open RAM BGA balls on chips or shifting traces... One won't really be able to completely identify the intermittent issue.

We were actually looking at buying a similar economy model earlier this year (ended up with a few classic Lenovo models instead)... so please be verbose with the make/model to help future searchers =3

Can't be thermal, I checked.
X-ray vision like Superman I gather... nice... ;)

Please dump the problematic cpu/ram chip model numbers to help other users. These chip manufacturer numbers is not really personally identifiable information, as they are shared between hundreds of thousands of products.

The classic cpu-z for Windows users is here if you don't run *nix:

https://www.cpuid.com/softwares/cpu-z.html

Best regards, =3

>X-ray vision like Superman I gather... nice... ;)

That snarkyness is uncalled for. I repasted the laptop, ran benchmarks and checked the temperature sensors plus used my FLIR. It's no thermal issues. It's just AMD iGPU driver buggyness.

  Processors Information
  -------------------------------------------------------------------------
  Socket 1      ID = 0
  Number of cores    8 (max 8)
  Number of threads  16 (max 16)
  Secondary bus #    0
  Number of CCDs    1
  Manufacturer    AuthenticAMD
  Name      AMD Ryzen 7 7840HS
  Codename    Phoenix
  Specification    AMD Ryzen 7 7840HS with Radeon 780M Graphics   
  Package     Socket FP7
  CPUID      F.4.1
  Extended CPUID    19.74
  Core Stepping    PHX-A1
  Technology    4 nm
  TDP Limit    54.0 Watts
  Tjmax      90.0 °C
  Core Speed    2761.5 MHz
  Multiplier x Bus Speed  27.71 x 99.6 MHz
  Base frequency (cores)  99.6 MHz
  Base frequency (mem.)  99.6 MHz
  Instructions sets  MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AES, AVX, AVX2, AVX512 (DQ, BW, VL, CD, IFMA, VBMI, VBMI2, VNNI, BITALG, VPOPCNTDQ, BF16), FMA3, SHA
  Microcode Revision  0xA704104
  L1 Data cache    8 x 32 KB (8-way, 64-byte line)
  L1 Instruction cache  8 x 32 KB (8-way, 64-byte line)
  L2 cache    8 x 1024 KB (8-way, 64-byte line)
  L3 cache    16 MB (16-way, 64-byte line)
  Preferred cores    2 (#1, #3)
  Max CPUID level    0000000Dh
  Max CPUID ext. level  80000026h
  FID/VID Control    yes
  # of P-States    3
  P-State      FID 0x898 - VID 0xBF (38.00x - 1.194 V)
  P-State      FID 0x858 - VID 0xAB (22.00x - 1.069 V)
  P-State      FID 0xA50 - VID 0x97 (16.00x - 0.944 V)
  PStateReg    0x80000000-0x49AFC898
  PStateReg    0x80000000-0x45AAC858
  PStateReg    0x80000000-0x4425CA50
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000

  Package Type    0x4
  Model      00
  String 1    0x0
  String 2    0x0
  Page      0x0
  Power Unit    0.0
  SMU Version    76.73.00
  TDP/TJMAX    0x36005A
  TCTL Offset    0x0
  PMTV      004C0008
  Package Power Tracking (PPT)    54.0 W (current)
  Package Power Limit #1 (long)     35.0 W
  Package Power Limit #2 (short)    25.0 W


  DMI Physical Memory Array  
   location  Motherboard
   usage   System Memory
   correction  None
   max capacity  64 GB
   max# of devices  4

  DMI Memory Device  
   designation  DIMM 0
   format   Row of chips
   type   LPDDR5
   total width  32 bits
   data width  32 bits
   size   8 GB
   speed   6400 MHz
   manufacturer  Micron Technology
   part number  MT62F2G32D4DS-026 WT
   serial number  00000000
   voltage   0.500000
   manufacturer id  0x2C80
   product id  0x0


  Display adapter 0 (primary) 
   ID   0x2180003
   Name   AMD RadeonT 780M
   Board Manufacturer Lenovo
   Codename  Phoenix
   Cores   768
   ROP Units  16
   Technology  4 nm
   Memory size  1024 MB
   Current Link Width x16
   Current Link Speed 16.0 GT/s
   PCI device  bus 99 (0x63), device 0 (0x0), function 0 (0x0)
    Vendor ID 0x1002 (0x17AA)
    Model ID 0x15BF (0x3819)
    Revision ID 0xC7
   Root device  bus 0 (0x0), device 8 (0x8), function 1 (0x1)
   Performance Level Current
    Core clock 800.0 MHz
    Shader clock 400.0 MHz
    Memory clock 800.0 MHz
   Driver version  32.0.11021.1011
   WDDM Model  3.1