Hacker News new | ask | show | jobs
by horsawlarway 1071 days ago
I started here https://github.com/ggerganov/llama.cpp

Which won't run everything, but will run model in the GGML format such as https://huggingface.co/TheBloke/llama-65B-GGML

The steps are basically:

1. Download a model

2. Make sure you have the latest nvidia driver for your machine, along with the cuda toolkit. This will vary by OS but is fairly easy on most linux distros.

3. compile https://github.com/ggerganov/llama.cpp following their instructions (in particular, look for LLAMA_CUBLAS for enabling GPU support)

4. Run the model following their instructions. There are several flags that are important, but you can also just use their server example that was added a few days ago - it gives a fairly solid chat interface.