| Testing CUDA kernels on 15 GPUs costs thousands every month we couldn’t afford that, so we built an emulator that predicts how your kernel runs on any GPU like H100, A100, 4090, or V100 without running a single line it’s not a guess, it gives real numbers 2.4ms on RTX 4090, 5.1ms on V100 within 1% of hardware how it works - NeuSight (99%) splits the kernel into tiles, simulates each one using real GPU specs like 132 SMs on H100 or 10 on 1060, checks occupancy, bandwidth, wave scheduling - NCU Baseline (95–98%) if you profiled once, we scale it across GPUs, Hopper is 1.05x Ada, Ampere 0.92x, all measured manually - Analytical (85–92%) roofline model fallback, works even without source code we validated on 47 kernels across 12 GPUs accuracy stayed above 98%, occupancy predictions were almost perfect one team saved $18k in GPU cloud time another found bugs on an A100 they didn’t own still missing dynamic parallelism, multi-GPU, and tensor core perfection but we’re getting there happy to go into the math or architecture details if anyone’s curious |