| HN Mirror

Unfortunately it has a really high "works on my machine" factor. I'm using Llama2-chat-13B via mlc-llm + whisper-streaming + coqui TTS. I just have a bunch of hardcoded paths and these projects tend to be a real pain to set up, so figuring out a nice way to package it up with its dependencies in a portable way is the hard part.

I'm mostly using llama2 because I wanted it to work entirely offline, not because it's necessarily faster, although it is quite fast with mlc-llm. Calling out to GPT-4 is something I'd like to add. I think the right thing is actually to have the local model generate the first few words (even filler words sometimes maybe) and then switch to the GPT-4 answer whenever it comes back.