Hacker News new | ask | show | jobs
by badmonster 408 days ago
1000+ tokens/sec on H100s, a 5–10x speedup over typical autoregressive models — and without needing exotic hardware like Groq or Cerebras - impressive
1 comments

Would batch inference increase throughput further? Or does it already peak the FLOPS?