|
|
|
|
|
by deckar01
1022 days ago
|
|
The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output. |
|
The inference code is shared as a proof of concept, it is not meant to be a production ready deploy. Also worth noting that not all LLMs are used to produce text which is read by humans.