Hacker News new | ask | show | jobs
by Joel_Mckay 490 days ago
In general, NVIDIA never had proper bug-free support in C for well over a decade (hidden de-allocation errors etc.), and essentially everyone focused on the cuda compiler with the C++ API.

To be honest, it still bothers me an awful GPU mailbox design is still the cutting-edge tech for modern computing. GPU rootkits are already a thing... Best of luck =3

2 comments

GPU rootkits are sounds like misnomer unless they start getting rewritable persistent storage (maybe they do now and my knowledge is out of date).

If you've got malicious code in your GPU, shut it off wait a few seconds, turn it back on.

Actually looking at the definition, my understanding might be off or the definition has morphed over time. I used to think it wasn't a rootkit unless it survived reinstalling the OS.

These have direct access to the dma channel of your storage device, and POC have proven mmu/CPU bypass is feasible.

My point was the current architecture is a kludge built on a kludge... =3

> with the C++ API

The funny thing is that the "C++ API" is almost entirely C-like, foregoing almost everything beneficial and convenient about C++, while at the same time not being properly limited to C.

(which is why I wrote this: https://github.com/eyalroz/cuda-api-wrappers/ )

> an awful GPU mailbox design is still the cutting-edge tech

Can you elaborate on what you mean by a "mailbox design"?

Depends on which CUDA API one is looking to,

https://docs.nvidia.com/cuda/cuda-c-std/index.html

I meant the fundamental ones, mostly:

* CUDA Driver API: https://docs.nvidia.com/cuda/cuda-driver-api/index.html * NVRTC: https://docs.nvidia.com/cuda/nvrtc/index.html * (CUDA Runtime API, very popular but not entirely fundamental as it rests on the driver API)

the CUDA C++ library is a behemoth that sits on top of other things.

In general, a modern GPU must copy its workload into/out-of its own working area in vram regardless of the compute capability number, and thus is constrained by the same clock-domain-crossing performance bottleneck many times per transfer.

At least the C++ part of the systems were functional enough to build the current house of cards. Best of luck =3