Hacker News new | ask | show | jobs
by brtmr 3812 days ago
Previous discussion on the same subject, about a post written by me: https://news.ycombinator.com/item?id=9245980

Basically, this issue is not restricted to NVidia GPUs or specific operating systems - This can be reproduced on Windows, Linux and OSX. Basically the concept of memory safety does not exist in the gpu space - which is the reason why the webgl standard is so strict about always zeroing buffers. The issue of breaking privacy and privilege boundaries on a multiuser system is very real, and there is no workable solution. This seems to be one of those problems where a lot of people are aware, but no one is sure how to fix it and so it just stays how it is.

8 comments

This is an issue I raised with colleagues back in the late 90's. At the time only high end graphics systems (SGI) had hardware graphics contexts. So one openGL app could crash or scribble all over another openGL applications canvas.

The truth is a GPU is an entire second computer attached via PCIE bus. As far as security is concerned this will continue to be a shit-show until we accept that fact and act accordingly.

How do we "act accordingly"?
By providing memory safety to the GPU in either software or hardware, or most likely some combination of both just like with standard memory now.
This isn't an issue anymore, nowadays there are MMUs which isolate shaders from memory they aren't supposed to touch.

OP's problem stems from the fact that some video buffer used by his browser hasn't been cleared after deallocation. At some later time this buffer either has been erroneously displayed instead of the game's buffer or has been allocated to the game which erroneously displayed it without filling with new content.

Perhaps you might be able to use the APIs designed for implementing DRM. There is a flag called D3D11_RESOURCE_MISC_HW_PROTECTED when creating a surface.
> there is no workable solution

Forgive my ignorance (not a graphics programmer), but why can't the drivers simply clear the buffer before handing it off to another application?

Pretty sure it has to do with benchmarks and the cut-throat competitive environment GPU-manufacturers exist in where you cut all corners to proclaim "We're the fastest!"

Not zeroing a buffer cuts a big constant out of overhead. If you know which of the benchmarks will fail if you don't zero the buffer, you code in an "exception" so the benchmark doesn't fail and other applications act wonky. This isn't the "first time" nVidia has been caught doing this, see:

http://www.geek.com/games/is-nvidia-cheating-on-benchmarks-5...

Apparently, AMD also partakes in benchmark-specific ''optimizations''[1]. Transparency is why many of us push for open source drivers.

http://www.cdrinfo.com/Sections/News/Details.aspx?NewsId=288...

Both AMD and NVidia drivers have special code paths for different applications. I don't think it's anything sinister since these are mostly fixes for the game bugs and the rest are resolutions for API ambiguities.

To give an example, consider the difference between memcpy() and memmove(). On most systems memcpy() is as same as memmove() in the sense it works even when the source and destination overlap. Then you decide to optimize memcpy and to prevent bugs like this https://bugzilla.redhat.com/show_bug.cgi?id=638477 you will need to set a flag USE_MEMMOVE_INSTEAD_MEMCPY for every app that you know to memcpy between overlapped regions. You could call this "cheating" or could be a reasonable person and say something like this https://bugzilla.redhat.com/show_bug.cgi?id=638477#c129 instead.

As for the original question. I am not an expert on the windows driver model but have written some GPU drivers and can tell that a) memory release is asynchronous i.e. you cannot reuse the memory until the GPU finishes using it and b) clearing graphics memory from CPU over the PCIe is slow and drivers, in general, do not program GPU on their own. Taking these into account, it seems the driver is not well positioned to do this and this is a task for the OS instead.

"I don't think it's anything sinister since these are mostly fixes for the game bugs and the rest are resolutions for API ambiguities." I think it is big problem. It is the same as forcing intel to change their CPU to workaround bugs in your application.
This analogy is pretty spot on, actually. It's a result of a long process of software evolution that went awry and this is a big reason why we need new APIs like Mantle, Vulkan and DirectX 12.

See this fascinating post:

http://www.gamedev.net/topic/666419-what-are-your-opinions-o...

You mean something like this https://en.wikipedia.org/wiki/A20_line ?
This probably won't happen, but it seems that games programmers are a large cause of problems for driver writers. Having to workaround bugs in games is bad for everyone.

Games Studios, IMO, should be made to fix their bugs themselves. They all have patching mechanisms these days, so it's not like it isn't impossible, or even unfeasible.

Not having this much problems fixed in the API is being currently worked on with DX12 and Vulcan. The point being removing a huge bunch of the abstraction provided by dx/opengl and thus forcing the dev to write more sensible code.

Currently the engine developer in graphics programming writes something and in reality he has no way of knowing what actually happens on the hardware (the API is just too high level to able to really know much). From there it is the hardware providers job to take out their own debugging tools and make sure correct things happen by having a custom code path in the driver.

It's a bit of the opposite, actually. There was a great article posted here (titled "Why I'm excited for Vulkan") where they explain how proprietary "tricks" GPU vendors use account for much of the necessity for game specific driver updates and optimizations. Game patches are to game bugs what driver updates (or "game profiles") are to what?

Lower level APIs like DX12 and Vulkan remove the competitive advantage vendor dependent performance creates, so well-coded games can perform consistently with lower overhead across ranges of hardware without having to rely on vendors to patch in the shortcuts through their drivers.

Currently, it's like filming a movie with IMAX specifications, then finding out that at different cinema chains it played with quality aberrations because their projectors didn't truly follow IMAX spec. The chains can fix it, but you're already getting blamed for the movie's issues. However, for a little money, on your next film they offer to work closely with you to ensure it shows the way you intended in their theaters. And no, they can't just tell you how to fix it-- their projection technology is a trade secret.

Eh given the constraints as you spell them out it seems a clear at the release moment driven by a shader could work.
This is probably because my explanation is very brief. I don't see how a shader (a program running on the GPU) can detect that the OS has killed a process and initiate a clear.
I don't get it, though. How can the reason be due to benchmarks / performance seeking?

The driver simply has to zero the buffer when the new OpenGL / graphics context is established. It's once per application establishing a context, not per-frame (the application is responsible for per-frame buffer clearing and the associated costs). At worst this would lengthen the amount of time a GPU-using application takes to start up and open new viewports, but that hardly seems like it would matter or even register on any benchmarks.

The thing is it probably not once per application. I'd imagine using multiple frame buffers in an application is actually quite common and could change quite often while an application is running; especially in complex applications like games. It's probably not enough of a hit to really justify not clearing the buffer but it's enough to make it noticeable in the benchmark race.
Across an entire computer system? For all applications? Even games? Sharing data across process boundaries is undesirable, but is something most computer users would accept if the alternative was reduced performance.

Why not just fix this in the browser? The real issue here is that this data isn't just being shared across processes but potentially with websites through malicious webgl.

If you can waste the time on allocating a buffer, you can waste the time on zeroing it. If you're in a hot loop you shouldn't be allocating giant chunks of memory.
It would be interesting to know how the cost of allocating a new fbo would compare to the cost of zeroing it out. My guess is that the cost of getting into the kernel to do the allocation in the first place would dominate, but by how much would be something neat to measure.
If it's costing a lot of time to clear a buffer, doesn't that tend to indicate that's something the video card manufacturer should design an enhancement or fix for?
> This seems to be one of those problems where a lot of people are aware, but no one is sure how to fix it and so it just stays how it is.

Well, MMUs on GPUs have been standard for a while. They just need to use them properly and at least have an opt-in mechanism to enforce zeroing of newly allocated pages.

If your mechanism is to be opt-in then it should clear deallocated pages :)
Indeed, you'd think each "app" (set of interacting processes; container; whatever) could have its own virtual GPU. But how would a compositing window manager work in that setup?
That _already_ requires sharing ownership of those GPU mappings, and you need to keep track of ownership at that level, or how else will the compositor know that the app is done drawing its window? If you don't wait, you can end up with tearing or even garbage in windows.
The same way shared memory works on general purpose computers.

Your porn browser declares it wants to share its window buffer with the WM, the kernel maps the buffer into WM's GPU address space and now WM's shaders can read from this buffer.

Indeed, depending on the drivers used, the VRAM contents can even survive across reboots. I saw this with some drivers on Linux some years ago, not sure if they still don't clear VRAM on boot.
It is still possible to inspect previous framebuffers after rebooting.
Oddly enough, this used to be a feature—in the system architectures of 30-40 years ago.

A lot of consoles had no concept of the CPU being the final arbiter of the system being running or halted—things like the PPU and SPU and so forth would just continue to merrily loop on even if the CPU halted, or was executing a CPU-reboot instruction.

You can see this in many NES and SNES games, where the game will "soft-lock": the CPU crashes, but the music (being a program running on the SPU) keeps playing, and the animations on the screen (being programs running on a PPU, or dedicated mappers feeding it) keep animating.

But this isolation can also be used deliberately, especially where the framebuffer is concerned. Since systems up until the 1990s-or-so had extremely small address-spaces ("8-bit" and "16-bit" are scary terms when you're trying to write a complex program), and since console games were effectively monolithic unikernels (even the ones "running on" OSes like DOS: DOS would basically kexec() the game), frequently a game's ROM or size-on-disk would exceed the capacity of the address space to represent it.

The solution to this was frequently to actually have several switchable ROM banks or several on-disk binaries, and to effectively transparently restart the system to switch between them. This isn't equivalent to anything as "soft" as kexec(); you wanted the CPU's state reset and main memory cleared, so your newly-loaded module could immediately begin to use it. Any state you wanted to preserve between these restarts would be stored on disk, or in battery-backed-RAM on a cartridge.

This is how C64 games managed to fit a rich-looking splash screen into their game: the splash screen was one program, and the game was another, and the splash screen would stay on the framebuffer while the C64 was rebooting into the game.

This is also the architecture of games like Final Fantasy 6 and 7: when the credits list developers' roles as something like "menu program", "battle program", or "overworld program", those aren't mistranslations of "programming"—those were literally separate programs that the console rebooted between, hopefully finishing in the time it took for the console to finish executing a fade-out on the PPU. When a battle starts in a Final Fantasy game, the CPU has been reset and main memory has been entirely cleared; everything the game knows about to run the battle is coming from SRAM. (And the reason the Chrono Trigger PSX port feels so laggy is that CT has this architecture too, but reading a binary from a CD takes a lot longer than switching a ROM bank. Games designed in the PSX era took that into consideration, but ports generally didn't.)

I've always thought it'd be cool to re-introduce this idea to game development, through a kind of abstract machine in the same sense as Löve or Haxe. You'd have a thin "graphical terminal emulator" that would contain PPU and SPU units and the SRAM, and would be controlled through an exposed pipe/socket (sort of like a richer kind of X server); and then you'd write a series of small programs that interact with that socket, none of them keeping any persistent state (only what they can read out of the viewer's SRAM), all of them passing control using exec().

(There's another thing you'd get from that, too: the ability to write strictly-single-threaded, "blocking" programs that nevertheless seemed not to block anything important like frame-rendering or music playing. You know how "Pause" screens worked in most games? They just threw up an overlay onto the PPU and then stuck the CPU into a busy-wait loop looking for an 'unpause' input. The game's logic wouldn't continue, but the game would still "be there", which was just perfect. This also allowed for "synchronous" animations—like Kirby's transformations in Kirby Super Star, or summoning spells in the FF games, or finishers in fighting games—to just run as a bit of blocking code on top of whatever was currently on the screen, without worrying that something would change state out from under them.)

The C64 showed an similar effect as described in the original post. To to a full reset you had to turn it off, wait about 5 seconds and turn it on again. If you did not wait the RAM was at least partly preserved. The C64 had no dedicated video RAM, the VIC just read from regular RAM. So if you did the power cycle quickly enough the screen was preserved.

Another effect of the screen RAM being regular RAM was that you actually could run programs in it while it was being displayed. You could watch a program run in the literal sense. This was often used by unpackers. The unpacker run in the screen RAM, filling the rest of the RAM. After its job was done the game started and filled the screen RAM with graphics destroying the unpacker.

EDIT: Found a video showing activity in the screen RAM of a C64. I'm not sure if this is the unpacker or this is really code execution, but it looks similar how I remember it.

https://www.youtube.com/watch?v=5nDzFsCEZT8&feature=youtu.be...

That's neat. But did that only work then with games that had writable storage media in the cartridge? I know that was rare on NES games.

Or was there secondary memory besides the RAM and disk that allowed for data to be passed between resets?

Yes, there was battery-backed SRAM, and then there was regular SRAM. Regular SRAM was volatile: it would be guaranteed to survive the CPU reboot instruction, but wouldn't survive poweroffs. Only some consoles had it; these were usually the same ones with a really slow bus speed for the battery-backed SRAM. (And other consoles, like the C64, actually didn't clear main memory on CPU reboot; on these consoles, you'd instead manually cycle through clearing everything except what you wanted to keep, and then reboot.)

In later consoles that had their own MMUs, like the PSX, this wasn't a full hardware feature anymore, but rather a simulated convention. You'd "reboot" by dropping all your virtual-memory mappings except one, then asking the disk to async-fill some buffers from the binary you wanted to launch and then mapping those pages and jumping into them on completion. (Basically like unloading one DLL and then loading a different one, except you're also forcefully dropping all the heap allocations the old DLL made when you unload it.)

In both the hard and soft implementations, the "volatile SRAM" page could be thought of as basically a writeback cache for the state in the actual battery-backed SRAM. You wouldn't want to do individual byte-level writes to SRAM (writes to SRAM were slowwww), so when the game booted, you'd mirror SRAM to your state-page, and then update the state-page whenever you had something you wanted to persist—finally dumping it out to battery-backed SRAM when the player hit "Save". Basically, most games were "auto-saving" from the beginning—but they were auto-saving to volatile memory.

But even games that had true "auto-saving", like Yoshi's Island, still kept an "SRAM buffer" like this; the write-to-SRAM event was just managed as a sequence of smaller bursts of memory-mapped IO done by modules that sat there playing music+animations and not doing any logic, like YI's "field/title card" submodule when re-entered from a loss-of-life event, or its "field/score card" submodule entered by completing a stage. If there is ever what seems to be a "pointlessly long" animation in an auto-saving game of that era near a state-transition, it's probably by design, to cover for an SRAM cache-flush. (The fact that flashy "rewarding" animations turned out to also be good game design, favored by slot-machine and casual-game designers the world over, is mostly coincidence.)

---

ETA: when you had neither kind of SRAM, but still wanted to preserve some state across a reboot, what could you do? Well, write it to video memory, of course!

On a system with a PPU, the PPU owned the VRAM; it wasn't the CPU's job to reset it, but rather the PPU's. On a system with only a framebuffer, nothing owned the framebuffer (or rather, the framebuffer was an inherited abstraction from the character buffers of teletypes: the "client" owned the framebuffer, so it was up to the "client" to erase it. Restarting a mainframe shouldn't forcibly clear all its connected teletypes; disconnecting from an SSH session shouldn't forcibly clear your terminal, but rather optionally clear your terminal as a way your TTY driver is set to respond to the signal; etc.)

Either way, if you could get your data into VRAM generally or the framebuffer specifically, you could very likely read it back after reboot.

VRAM is also where many "online" development suites—the BASICs and Pascals of the time—expected you to write exception traces. Rather than trying to "break into" a debugger (i.e. cram a debugger into the same address space as your software), you'd simply have your trap-handler persist the stack-trace to VRAM, switch your tape drive over to the devtools disk, and reboot. The "monitor" would load, notice that there's a stack-trace in VRAM, parse it, read the pages it mentions from your program (now on the slave tape) and display them.

Thank you, that was incredibly interesting!
Scanning the comments, it seems broken that there's no way for an end user to clear GPU memory. Particularly a way that works across platforms, or at least across drivers would be ideal.

If it can persist across restarts... does a shutdown do anything differently? I'm not too familiar with GPU programming, but there has to be something we can do.

Yeah, I remember reading that when it was first posted. Glad to see other people are aware, but it's disappointing nothing has been done.

The fix is pretty simple, the GPU manufacturer just needs to update their driver to zero the VRAM like an OS would with RAM.

Or the window manager should zero things out? It seems more consistent for the OS to be responsible for this then say the driver which also does things like supporting OpenGL and GPGPU usecases.
You can generally allocated GPU buffers without the window manager. So to be effective this would have to be in the kernel driver. Opengl/GPGPU is generally done in the user space driver so it'd still be separate.
When graphics card vendors want to protect their competitive edge and not do OS integration beyond dropping horrible and opaque binary blobs every now and then, they'll also have to mop up the security issues such as this on their own.

Technically this should be a OS responsibility, but practically the vendors have made that all but impossible.

> The fix is pretty simple, the GPU manufacturer just needs to update their driver to zero the VRAM

You seem to be using OSX (judging by the screenshots).

You should be aware that OSX's GPU drivers are written by Apple (or at least they act as the gatekeeper). You need to send the bug report to them. And perhaps update the title of your post to "Apple breaks..."

I've seen this exact same behavior on OSX with an Intel GPU.

As mentioned elsewhere in this post, e.g. Windows WDDM drivers require memory to be zeroed out.

Are device drivers ever updated, much less for security issues?

Seems like an obvious way to increase the take, in any event: "Chip X has a security flaw. Chip Y does not. Buy chip Y now or Evil People will own-zor all your cash-zors."

Since we're talking about GPUs... search the NVIDIA archive [1], and look for the WHQL drivers:

    Version: 359.00 - Release Date: Thu Nov 19, 2015
    Version: 358.91 - Release Date: Mon Nov 09, 2015
    Version: 358.87 - Release Date: Wed Nov 04, 2015
    Version: 358.50 - Release Date: Wed Oct 07, 2015
[1] http://www.nvidia.com/drivers/beta
So what about the BSODs that device drivers cause? Are those not device driver issues, but OS issues, or are they unfixable?
They could be any of those.

In a sense, BSODs aren't anything special -- all a BSOD means is that some code running in kernel mode has crashed or raised some exception that went unhandled. The same thing, when it happens in a user-mode program, gets you the error dialog box 'Program has stopped working'.

So the causes of BSODs and user application crashes are the same. The reason Windows has BSODs is that it's dangerous to keep the system going when something in kernel mode crashes. Things running in kernel mode have access to everything (think - all memory) and are deemed important enough to the operation of the whole system that a crash in one of those is a significant event that's worthy of special logging and rebooting. You can't guarantee, for example, that a display driver crash hasn't corrupted other parts of memory, cuasing potential for data loss if the system were to continue operating.

So, back to the original point. Device-driver BSODs from the big vendors are probably rare enough in general that you should suspect a hardware problem or glitch if you suddenly see one out of the blue. Graphics drivers, given their complexity, are a bit more prone to crashing though. Also, things running on the system can interact and cause the driver to crash.

Windows has lots of infrastructure in place for making sure device drivers behave safely. There's also good facilities for figuring out exactly what caused a BSOD beyond the usually cryptic-looking error code you see on the screen.

Resplendence WhoCrashed is handy: http://www.resplendence.com/whocrashed

Though if you really want to dig deep, the tools with the Windows SDK (particularly WinDbg) can let you achieve the same thing; they are developer tools though, so targeted more to that audience.

EDIT: Just to add in answer to your original comment, big-vendor graphics drivers are VERY often updated. I'd bet they're the most often updated drivers on a system. There are myriad reasons for this, both technical and competitive. That doesn't mean that long-standing problems are necessarily fixed, but both AMD and Nvidia have very regular releases with fixes and performance improvements.

Along a similar line, in my own experience, since around Windows 2000 (not ME) it's extremely rare to see a BSOD that isn't related to either bad hardware or drivers, more often than not hardware related to a driver than the driver itself.
> Are device drivers ever updated, much less for security issues?

Yep, all the time. http://i.imgur.com/8k3ffLa.png

People who often play 3D games always update their graphics drivers :)
Open source gpu drivers on Linux clear all allocations the kernel driver hands to userspace. And where it exists, different clients are also isolated from each another through the gpu MMU. On top of that all drivers guarantee that no gpu client can escape the gpu sandbox to general system memory, and on chips where hw engineers didn't provision any useful hw support to enforce this it is done by pretty costly gpu command stream parsing in the kernel.

You can't opt out of these security features on upstream/open source linux drivers either.

Now of course this won't insulate different tabs in chrome since chrome uses just one process for all 3d rendering. But GL_ARB_robusteness guarantees plus webgl requiring that you clear textures before handing them to webpages means that should work too. On top of that webgl uses gl contexts (if available), and on most hw/driver combos that support gpu MMUs even different gl contexts from the same process are isolated.

This really is a big problem with binary drivers, and has been known for years.

WebGL always seemed too low-level for reasons like this - same way Gmaps crashes my Firefox browser all the time.