| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jaberjaber23 262 days ago

Testing CUDA kernels on 15 GPUs costs thousands every month

we couldn’t afford that, so we built an emulator that predicts how your kernel runs on any GPU like H100, A100, 4090, or V100 without running a single line

it’s not a guess, it gives real numbers

2.4ms on RTX 4090, 5.1ms on V100 within 1% of hardware

how it works

- NeuSight (99%) splits the kernel into tiles, simulates each one using real GPU specs like 132 SMs on H100 or 10 on 1060, checks occupancy, bandwidth, wave scheduling

- NCU Baseline (95–98%) if you profiled once, we scale it across GPUs, Hopper is 1.05x Ada, Ampere 0.92x, all measured manually

- Analytical (85–92%) roofline model fallback, works even without source code

we validated on 47 kernels across 12 GPUs

accuracy stayed above 98%, occupancy predictions were almost perfect

one team saved $18k in GPU cloud time

another found bugs on an A100 they didn’t own

still missing dynamic parallelism, multi-GPU, and tensor core perfection but we’re getting there

happy to go into the math or architecture details if anyone’s curious