Hacker News new | ask | show | jobs
by shahahmed 728 days ago
arguably you can reduce even more latency by keeping the model on-device as well, but that would mean revealing the weights of the fine-tuned model.

If the user preferred reduced latency and had the RAM, is that an option?

4 comments

This is true, but only if you have a GPU (/accelerator) comparable in performance to the one backing the service, or at least comparable after accounting for the local benefit. This is an expensive proposition because it will be sitting idle between completions and when you're not coding.
Not for this fine-tuned model yet, but Cody supports local models: https://sourcegraph.com/docs/cody/clients/install-vscode#sup....

I just used Cody with Ollama for local inference on a flight where the wifi was broken, and it never fails to blow my mind: https://x.com/sqs/status/1803269013310759236.

Looking at their GitHub page or seems like they are using existing LLM services. It should be possible to modify cody to make it work with a local llm
You don’t have to modify anything. We support local LLMs for chat and completions with Ollama.

https://sourcegraph.com/blog/local-code-completion-with-olla...

the model is probably most of the "secret sauce" of cody, so if they gave that away people could just copy it around like mp3s. my guess
Completely incorrect, as Sourcegraph has not historically trained models and Cody swaps between many open source and 3rd party models.