| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by npsomaratna 1023 days ago

You don't need so many layers of stuff (or API keys, signups, or other nonsense).

Llama.cpp (to serve the model) + the Continue VS Code extension are enough.

The rough list of steps to do so are:

  Part A: Install llama.cpp and get it to serve the model:
  --------------------------------------------------------
  1. Install the llama.cpp repo and run make.
  2. Download the relevant model (e.g. wizardcoder-python-34b-v1.0.Q4_K_S.gguf).
  3. Run the llama.cpp server (e.g., ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock).
  4. Run the OpenAI like API server [also included in llama.cpp] (e.g., python ./examples/server/api_like_OAI.py).

  Part B: Install Continue and connect it to llama.cpp's OpenAI like API:
  -----------------------------------------------------------------------
  5. Install the Continue extension in VS Code.
  6. In the Continue extension's sidebar, click through the tutorial and then type /config to access the configuration.
  7. In the Continue configuration, add "from continuedev.src.continuedev.libs.llm.ggml import GGML" at the top of the file.
  8. In the Continue configuration, replace lines 57 to 62 (or around) with:

    models=Models(
        default=GGML(
            max_context_length=16384,
            server_url="http://localhost:8081"
        )
    ),

  9. Restart VS Code, and enjoy!

You can access your local coding LLM through the Continue sidebar now.

6 comments

tayo42 1023 days ago

One of the most annoying things about learning ai/ml for me right now is how much of this stuff is hidden behind people's comlanies and projects with to many emojis.

Like I can't find simple straight foward solutions or content that isn't tied back to a company.

link

jfoucher 1023 days ago

I'm a complete beginner regarding this stuff, so if I may ask, how would I go about downloading the relevant model (e.g. wizardcoder-python-34b-v1.0.Q4_K_S.gguf) I checked on Hugging face but all I got was a bunch of .bin files...

Thanks.

link

npsomaratna 1023 days ago

Do a search on the HuggingFace models page, e.g.:

https://huggingface.co/models?sort=trending&search=wizardcod...

link

jfoucher 1023 days ago

Thanks, I managed to convert what I had downloaded with the convert.py script in llama.cpp.

link

rpgwaiter 1023 days ago

Google the filename + "torrent download"

link

k4rli 1023 days ago

Thanks, works nicely and easy to set up.

Is it possible to use GPU for this? With R9 7900x and 32GB RAM it takes 15-30sec to generate response. I have a 6900XT which might be more suited for this.

link

npsomaratna 1023 days ago

Yes. In the llama.cpp server command, specify the number of layers you'd like offloaded to your GPU via the -ngl parameter, e.g.:

  ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock -ngl 60

(You might need to play around with the number of layers.)

[Edit: make sure to compile llama.cpp with GPU support first, e.g., "make clean && LLAMA_CUBLAS=1 make -j"]

link

redox99 1023 days ago

Is there a way to make it work with ooba+exllama? (much faster than llamacpp)

link

thelastparadise 1023 days ago

You should be able to turn on the API in booba:

https://github.com/oobabooga/text-generation-webui#api

link

redox99 1023 days ago

But that API isn't OpenAI compatible AFAIK

link

noiv 1023 days ago

Thx. Where can I send flowers to?

link

ignoramous 1022 days ago

To any person you're in a position to be kind to.

link

vanillax 1023 days ago

wodner if you can pair with https://github.com/getumbrel/llama-gpt

link