|
|
|
|
|
by HPsquared
975 days ago
|
|
Llama 2 and various derivatives as the model. Get quantized models from https://huggingface.co/TheBloke Oobabooga text-generation-webui for the server. In the interface, use ExLlama for GPU inference (fast; for smaller models which fit in VRAM). Llama.cpp for large models (higher fidelity but slower), CPU+GPU. 13B parameter 4-bit quantized model (type 'GPTQ") can fit in a 12GB RTX 3060. 24GB card (e.g. a 3090) needed for 30B model on GPU. Something like 5-10 tokens/sec. Can run 65 or 70B parameter models on CPU (e.g i7 12700) with 64GB RAM (also need decent GPU as above). Around 1 token/sec. These models are type "GGML" / "GGUF". Long prompts take a long time for initial ingestion on CPU+GPU, much faster on GPU only. Llama.cpp also apparently runs very well on Apple silicon, with the shared memory between CPU and GPU being well-suited. |
|