|
|
|
|
|
by sacrelege
64 days ago
|
|
We typically observe throughput of around 100–110 toks/s, and for larger context sizes this ranges between 90–100 toks/s. While we don't guarantee a fixed toks/s rate, we scale by provisioning external GPU nodes during peak demand. These nodes run our own dockerized environment over a secure tunnel. Our goal is to ensure a consistent baseline performance of at least 60–80 toks/s, even under high load. |
|