| download llama.cpp convert the fine tuned model into gguf format. choose a number of quantization bits such that the final gguf will fit in your free ram + vram run the llama.cpp server binary. choose the -ngl number of graphics layers which is the max number that will not overflow your vram (i just determine it experimentally, i start with the full number of layers, divide by two if it runs out of vram, multiply by 1.5 if there is enough vram, etc) make sure to set the temperature to 0 if you are doing facts based language conversion and not creative tasks if it's too slow, get more vram ollama, kobold.cpp, and just running the model yourself with a python script as described by the original commenter are also options, but the above is what i have been enjoying lately. everyone else in this thread is saying you need gpus but this really isn't true. what you need is ram. if you are trying to get a model that can reason you really want the biggest model possible. the more ram you have the less quantized you have to make your production model. if you can batch your requests and get the result a day later, you just need as much ram as you can get and it doesn't matter how many tokens per second you get. if you are doing creative generation then this doesn't matter nearly as much. if you need realtime then it gets extremely expensive fast to get enough vram to host your whole model (assuming you want as large a model as possible for better reasoning capability) |