| HN Mirror

thanks!!!! appreciate that a lot. i’ve mainly tested it on MNIST for now, the CUDA backend trains one epoch in ~0.4ms (batch size 1000, RTX 3060, as i mentioned in the post). It was primarily a deep dive into CUDA/C++, manual memory management, and building a dual backend architecture with a custom matrix lib (GPU-side completely from scratch). this was actually my 4th serious attempt at building a GPU-based MLP from scratch. I failed multiple times, sometimes due to a single line of code. in earlier attempts, i had this optimization idea: store both the weights and their transposes in GPU memory, so i wouldn’t have to compute the transpose each epoch. Seemed clever, until training started failing badly. Turned out I was only updating the original weights matrix after backprop, but the transposed one was still holding stale values from earlier. this broke training completely, and I spent weeks trying to debug it, couldn’t figure it out until this current version.

honestly, the biggest gotchas were-

-memory coherence issues like above (esp. when trying to cache 'smartly')

-launching kernels in the right order while keeping data in sync

-maintainingg modularity without sacrificing too much performance

i avoided fused kernels/shared memory in this version to keep things clean and reusable, but now that the core works, I plan to start optimizing that layer too.