Hacker News new | ask | show | jobs
by kioku 135 days ago
> Our key insight is to offload critical softmax primitives to idle tensor units, maximizing hardware utilization and throughput.

> … speedups of 1.05–1.17×across diverse attention configurations on Ampere and Hopper GPUs …