|
|
|
|
|
by huac
856 days ago
|
|
> 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia. I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step. For these use cases, time to first byte is the most important metric, not total throughput. |
|
The most interesting applications of LLMs are not chatbots.