Hacker News new | ask | show | jobs
Llama 405B up to 142 tok/s on Nvidia H200 SXM (old.reddit.com)
2 points by avianion 601 days ago
1 comments

Happy to announce this breakthrough, made largely possible by Nvidia's H200 SXMs and a proprietary speculative decoding algorithm.

We've launched a production grade API endpoint at $3 per million tokens. We also have some capacity for fine tuning 405B, while still keeping the speed increases, so if you're interested please get in touch.