|
|
|
|
|
by CaptainOfCoit
237 days ago
|
|
Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening. Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is... |
|
Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.
Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.
It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.