Hacker News new | ask | show | jobs
by throwaway698585 1045 days ago
VirGL is a poor solution to the pressing problem of virtualized graphics. It only really exists because the hardware makers AMD/Intel/Nvidia in their infinite greed refuse to support VFIO on all GPU's like how IOMMU is supported on nearly all CPU's.
5 comments

That's 100% fair. Good thing it's not too difficult to assign VFIO w/i QEMU for virtual machines despite the manufacturer shenanigans. :)

The Arch wiki has a great guide here - https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVM...

It does get a little tricky if your GPUs are identical, but I've done this for years and maintain a guide for doing this (as well as the ACS-override patched kernel RPMs) for Fedora.

- Writeup - https://some-natalie.dev/blog/fedora-acs-override/

- Code + RPMs - https://github.com/some-natalie/fedora-acs-override

As far as concerns around stability with ACS override, I tend to only enable the override for the specific GPU (or other hardware) that I'm passing through and haven't encountered any stability problems or memory leaks that'd interrupt desktop or light server usage. I also used to run this for a bunch of white-box GPU hardware for a customer at a former job and it worked well for exploratory AI/ML workloads before investing in the big Nvidia DGX boxes. YMMV, of course!

Is there a reason an emulator like qemu couldn't just pass along spirv like wasm does with webgpu? Would that be way slower?
A better way [for me, anyways] is getting GRID drivers running. However, this only works with the 9xx cards up to the 2080.

https://gitlab.com/polloloco/vgpu-proxmox

It's not difficult but it misses the point. SIOV supports 1k's of VF's because that's what you need if you want a sandboxed app-per-VM security model. When statically compiled VM's are just as performant as containers but more secure.
Why is this a requirement? I thought Linux takes over after it loads.

  Your guest GPU ROM must support UEFI"
Correct me if I am wrong on this, but to me, it would seem that something like VirGL would still serve a purpose with wider spreader full SR-IOV support on consumer GPUs, as VirGL could find application in many scenarios where a GPU vendor's drivers are not compatible with the guest.

Saying that, I do agree that vendors should enable support in customer GPUs and feel that their focus on protecting server sales is going to turn out misguided in the long term. Intel especially disappointed in this area, as they in the past did allow such functionality on their GPUs, but have recently removed that.

AMDs mainstream CPUs supporting server features such as ECC also have proven that such restrictions aren't necessary, and allowing this type of capability on mainstream platforms in no way harms enterprise sales.

That being said, any effort focused on GPU virtualization or drivers impresses me immensly and I very much appreciate the work done on VirGL.

>Correct me if I am wrong on this, but to me, it would seem that something like VirGL would still serve a purpose with wider spreader full SR-IOV support on consumer GPUs, as VirGL could find application in many scenarios where a GPU vendor's drivers are not compatible with the guest.

Pretty much every guest OS (windows, Linux, BSD) has drivers that would work with a native PCIe VF GPU device. MacOS still has AMD drivers but only up to RDNA 2. I can't think of any guest that would support a GL device but not have a native driver.

>Saying that, I do agree that vendors should enable support in customer GPUs and feel that their focus on protecting server sales is going to turn out misguided in the long term. Intel especially disappointed in this area, as they in the past did allow such functionality on their GPUs, but have recently removed that.

Intel supports SRIOV/SIOV on consumer CPU iGPU's (Xe, 11th, 12th, and 13th gen) but not dGPU's (A770, A750..) which is very frustrating. Indeed 'enterprise features' such as ECC or IOMMU on consumer chips have not affected server sales.

>That being said, any effort focused on GPU virtualization or drivers impresses me immensly and I very much appreciate the work done on VirGL.

agreed

GPU Virtualization: life's toughest challenge ;)

GVT-g high-level design

https://projectacrn.github.io/1.6/developer-guides/hld/hld-A...

GVT-g was intel's first crack at virtualization and is now abandoned. It was supplanted by Intel supporting SR-IOV which itself was succeeded by intel's SIOV.

https://www.intel.com/content/www/us/en/developer/articles/t...

Yeah, I would love to have api layer pass through as well as pcie layer pass through. API pass through would work well for things like containers or sandbox environments.
AMD supports VFIO on most of their cards. All of the RDNA based cards support it. Even some pre-RDNA ones too, and with a recent-ish driver NVIDIA's Geforce line supports it too without hacks.

Problem isn't really HW support, its that the software side is super glitchy and its not all that easy to configure and in most cases requires a 2nd GPU if you still want basic host functions.

Where as VirGL is much easier to get working and doesn't require specific HW support as far as I'm aware.

You are likely confusing what is better termed "PCIe passthrough" with virtual functions wherein one physical GPU presents itself on the PCIe bus as (dozen's in the case of SR-IOV or thousands in the case of SIOV) of GPU functions which can be passed to dozens or thousands of GPU enabled VM's.

https://events19.linuxfoundation.cn/wp-content/uploads/2017/...

Unfortunately AMD cards suffer from a reset bug, still, when used with passthrough.

The reset bug being that you can pass through the card fine, once. But if you try to pass it through again (or the card experiences an issue and needs to reset), they get caught in some kind of bad state and won’t work until power is removed and restored. Which requires a reboot or a only slightly less disruptive dance with system power states.

For vega and 5000 series gpu’s, there’s https://github.com/gnif/vendor-reset

Incidentally, nvidia gpus are so good at resetting, they’ve probably done so without you noticing. If the screen ever goes black for a fraction of a second and returns in normal usage, it was probably because it reset itself.

The lower 6000 series lower than the 6800’s for example may or may not have the issue. It seems most “reference” cards are fine, but custom vendor cards often but not always have issues. My reference 6700 works fine, but a sapphire 6700 probably won’t.

And the 7000 series is also fucky in a new way somehow. Gnif knows far more about this than me, and has basically thrown up his hands at how AMD doesn’t care. He’s made occasional posts about it on https://forum.level1techs.com/

Gnif is also responsible for Looking glass: https://github.com/gnif/LookingGlass

When it comes to splitting a gpu into virtual ones with SR-IOV/MxGPU, that’s not really a thing with AMD. Whereas Nvidia will happily give you what you want if you shovel some money into their gaping maw, AMD won’t even give business customers the time if day if you aren’t worth billions. They very deliberately do not want the unwashed masses to use MxGPU. See: https://www.reddit.com/r/VFIO/comments/eqvn9z/amd_mxgpu_or_s... for a summary of the years of hopelessness on this functionality.

You could also be in a fun grey area where your 6600XT will sometimes reset properly and allow the physical host to reclaim it for its own purposes. Or not and require a full forced shutdown and reboot to restore proper function.

Let's just say I'm very aware of AMD's issues in this area :P

Also looking glass is pretty great though it's use of the Desktop Duplication API seemed to carry with it a huge performance hit. Or rather did the last time I used it (it's been a while)

Isn't it dangerous to give a guest direct access to a GPU?
It's not really to different to giving your web browser access to your GPU (and by extension to random websites using WebGL). So yes, it's dangerous. But it is at least a threat which designers of GPUs are already considering. Although there have been interesting bugs where GPU memory hasn't been zeroed before allocating it to a new context and you could read previously written graphics memory to find secrets.

As long as 1 GL context on the guest side == 1 GL context on the host side then it _should_ at least be as safe as letting your web browser access your GPU but certainly not as safe as using an IOMMU to segregate a whole GPU solely for your VM.

I feel like a good half of machines I find with glitchy graphics drivers seem to show bits of textures from one application inside another application - indicating memory contents leakage between contexts. Chunks of webpages from Chrome appearing in 3D games seems common.

And those are accidentally caused leaks. As soon as someone starts storing actually sensitive data in graphics memory, I'm sure lots of methods to deliberately cause leaks will be found.

I've had a more extreme case in a dual boot configuration of some graphics corruption on my Linux desktop exposing a mirrored and discolored frame of my prior Windows desktop from before rebooting.
Security isolation isn't the only possible goal of virtualization.
It's not dangerous because it needs IOMMU support, the GPU can only access the memory space of the guest.
You mean SR-IOV?