|
|
|
|
|
by everforward
769 days ago
|
|
You run most of these models in something that wraps them in an HTTP API. I use Ollama, which I think is the most popular but I’m not in a great position to judge. My impression is that it handles running models on CPU better. So you’d basically install Ollama, download one of the versions of this model off HuggingFace, create a Modelfile since this isn’t in the default Ollama repo, and then Ollama can answer prompts with the model. Modelfiles are very simple, based on Dockerfiles. It takes like 15 seconds to make one if you aren’t messing with the various parameters. Once it’s in Ollama, just get one of the various GPT plugins for VSCode and give it the Ollama URL (http://localhost:11434 by default). I use continue.dev but there are many. Continue takes over the tab autocomplete with the LLM, and has a chat window on the right where you can use keyboard shortcuts to copy code into the prompt and ask it to edit/generate code or ask questions about existing code. |
|
the server is here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...
And you can search for any GGUF on huggingface