| HN Mirror

Technically anything that's based on pytorch can run on CPU, you just need to tell it to do so. For example, in textgen add '--cpu' and you're done. It will be super slow though.

GGML format is meant to be executed through llama.cpp, which doesn't use GPU by default. You can often find these models in a quantized form as well, which helps performance (at a cost of accuracy). Look for q4_0 for the fastest performance and lowest RAM requirements, look for 5_1 for the best quality right now (well, among quantized models).

Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:

pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python