Hacker News new | ask | show | jobs
by ssivark 1 day ago
When doing auto regressive inference, how often do you do a CUDA kernel call? What is the main bottleneck at the throughputs you're operating?