| HN Mirror

Llama 2 and various derivatives as the model. Get quantized models from https://huggingface.co/TheBloke

Oobabooga text-generation-webui for the server.

In the interface, use ExLlama for GPU inference (fast; for smaller models which fit in VRAM). Llama.cpp for large models (higher fidelity but slower), CPU+GPU.

13B parameter 4-bit quantized model (type 'GPTQ") can fit in a 12GB RTX 3060. 24GB card (e.g. a 3090) needed for 30B model on GPU. Something like 5-10 tokens/sec.

Can run 65 or 70B parameter models on CPU (e.g i7 12700) with 64GB RAM (also need decent GPU as above). Around 1 token/sec. These models are type "GGML" / "GGUF".

Long prompts take a long time for initial ingestion on CPU+GPU, much faster on GPU only.

Llama.cpp also apparently runs very well on Apple silicon, with the shared memory between CPU and GPU being well-suited.