Hacker News new | ask | show | jobs
by npsomaratna 1023 days ago
You don't need so many layers of stuff (or API keys, signups, or other nonsense).

Llama.cpp (to serve the model) + the Continue VS Code extension are enough.

The rough list of steps to do so are:

  Part A: Install llama.cpp and get it to serve the model:
  --------------------------------------------------------
  1. Install the llama.cpp repo and run make.
  2. Download the relevant model (e.g. wizardcoder-python-34b-v1.0.Q4_K_S.gguf).
  3. Run the llama.cpp server (e.g., ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock).
  4. Run the OpenAI like API server [also included in llama.cpp] (e.g., python ./examples/server/api_like_OAI.py).

  Part B: Install Continue and connect it to llama.cpp's OpenAI like API:
  -----------------------------------------------------------------------
  5. Install the Continue extension in VS Code.
  6. In the Continue extension's sidebar, click through the tutorial and then type /config to access the configuration.
  7. In the Continue configuration, add "from continuedev.src.continuedev.libs.llm.ggml import GGML" at the top of the file.
  8. In the Continue configuration, replace lines 57 to 62 (or around) with:

    models=Models(
        default=GGML(
            max_context_length=16384,
            server_url="http://localhost:8081"
        )
    ),

  9. Restart VS Code, and enjoy!
You can access your local coding LLM through the Continue sidebar now.
6 comments

One of the most annoying things about learning ai/ml for me right now is how much of this stuff is hidden behind people's comlanies and projects with to many emojis.

Like I can't find simple straight foward solutions or content that isn't tied back to a company.

I'm a complete beginner regarding this stuff, so if I may ask, how would I go about downloading the relevant model (e.g. wizardcoder-python-34b-v1.0.Q4_K_S.gguf) I checked on Hugging face but all I got was a bunch of .bin files...

Thanks.

Do a search on the HuggingFace models page, e.g.:

https://huggingface.co/models?sort=trending&search=wizardcod...

Thanks, I managed to convert what I had downloaded with the convert.py script in llama.cpp.
Google the filename + "torrent download"
Thanks, works nicely and easy to set up.

Is it possible to use GPU for this? With R9 7900x and 32GB RAM it takes 15-30sec to generate response. I have a 6900XT which might be more suited for this.

Yes. In the llama.cpp server command, specify the number of layers you'd like offloaded to your GPU via the -ngl parameter, e.g.:

  ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock -ngl 60
(You might need to play around with the number of layers.)

[Edit: make sure to compile llama.cpp with GPU support first, e.g., "make clean && LLAMA_CUBLAS=1 make -j"]

Is there a way to make it work with ooba+exllama? (much faster than llamacpp)
You should be able to turn on the API in booba:

https://github.com/oobabooga/text-generation-webui#api

But that API isn't OpenAI compatible AFAIK
Thx. Where can I send flowers to?
To any person you're in a position to be kind to.
wodner if you can pair with https://github.com/getumbrel/llama-gpt