Are there any turnkey engines designed to run locally which can be trained on your own data? I've been itching to put my work into one, just to see what the results might be.
In the interface, use ExLlama for GPU inference (fast; for smaller models which fit in VRAM). Llama.cpp for large models (higher fidelity but slower), CPU+GPU.
13B parameter 4-bit quantized model (type 'GPTQ") can fit in a 12GB RTX 3060. 24GB card (e.g. a 3090) needed for 30B model on GPU. Something like 5-10 tokens/sec.
Can run 65 or 70B parameter models on CPU (e.g i7 12700) with 64GB RAM (also need decent GPU as above). Around 1 token/sec. These models are type "GGML" / "GGUF".
Long prompts take a long time for initial ingestion on CPU+GPU, much faster on GPU only.
Llama.cpp also apparently runs very well on Apple silicon, with the shared memory between CPU and GPU being well-suited.
Oobabooga text-generation-webui for the server.
In the interface, use ExLlama for GPU inference (fast; for smaller models which fit in VRAM). Llama.cpp for large models (higher fidelity but slower), CPU+GPU.
13B parameter 4-bit quantized model (type 'GPTQ") can fit in a 12GB RTX 3060. 24GB card (e.g. a 3090) needed for 30B model on GPU. Something like 5-10 tokens/sec.
Can run 65 or 70B parameter models on CPU (e.g i7 12700) with 64GB RAM (also need decent GPU as above). Around 1 token/sec. These models are type "GGML" / "GGUF".
Long prompts take a long time for initial ingestion on CPU+GPU, much faster on GPU only.
Llama.cpp also apparently runs very well on Apple silicon, with the shared memory between CPU and GPU being well-suited.