|
|
|
|
|
by rayuela
995 days ago
|
|
Can you share a github link to this? Where are you reducing the latency? Are you processing the raw audio to text? In my experience ChatGPT generation time is much faster than local Lllama unless you're using something potato like a 7B model. |
|
I'm mostly using llama2 because I wanted it to work entirely offline, not because it's necessarily faster, although it is quite fast with mlc-llm. Calling out to GPT-4 is something I'd like to add. I think the right thing is actually to have the local model generate the first few words (even filler words sometimes maybe) and then switch to the GPT-4 answer whenever it comes back.