Baseten: 592.6 tps Groq: 784.6 tps Cerebras: 4,245 tps
still impressive work
That said, we are serving the model at its full 131K context window, and they are serving 33K max, which could matter for some edge case prompts.
Additionally, NVIDIA hardware is much more widely available if you are scaling a high-traffic application.
That said, we are serving the model at its full 131K context window, and they are serving 33K max, which could matter for some edge case prompts.
Additionally, NVIDIA hardware is much more widely available if you are scaling a high-traffic application.