Hacker News new | ask | show | jobs
by Tostino 470 days ago
You are missing something. This is a single stream of inference. You can load up the Nvidia card with at least 16 inference streams and get at much higher throughout tokens/sec.

This just is just a single user chat experience benchmark.