|
|
|
|
|
by hansonw
1020 days ago
|
|
This is the best comparison I've found that benchmarks the current OSS inference solutions: https://hamel.dev/notes/llm/inference/03_inference.html IME the streaming API in text-generation-inference works fine in production. (Though some of the other solutions may be better). I've used it with Starcoder (15B) and the time-to-first-token / tokens per second all seem quite reasonable out of the box. |
|