| HN Mirror

Well I'm not sure which models specifically work, but it runs on llama.cpp, which would mean lama derivative ones. Here's a little table for quantized CPU (GGML) versions and the RAM they require as a general rule of thumb:

> Name Quant method Bits Size RAM required Use case

WizardLM-7B.GGML.q4_0.bin q4_0 4bit 4.2GB 6GB 4bit.

WizardLM-7B.GGML.q4_1.bin q4_0 4bit 4.63GB 6GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.

WizardLM-7B.GGML.q5_0.bin q5_0 5bit 4.63GB 7GB 5-bit. Higher accuracy, higher resource usage and slower inference.

WizardLM-7B.GGML.q5_1.bin q5_1 5bit 5.0GB 7GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.

WizardLM-7B.GGML.q8_0.bin q8_0 8bit 8GB 10GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.

> Name Quant method Bits Size RAM required Use case

wizard-vicuna-13B.ggmlv3.q4_0.bin q4_0 4bit 8.14GB 10.5GB 4-bit.

wizard-vicuna-13B.ggmlv3.q4_1.bin q4_1 4bit 8.95GB 11.0GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.

wizard-vicuna-13B.ggmlv3.q5_0.bin q5_0 5bit 8.95GB 11.0GB 5-bit. Higher accuracy, higher resource usage and slower inference.

wizard-vicuna-13B.ggmlv3.q5_1.bin q5_1 5bit 9.76GB 12.25GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.

wizard-vicuna-13B.ggmlv3.q8_0.bin q5_1 5bit 16GB 18GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.

> Name Quant method Bits Size RAM required Use case

VicUnlocked-30B-LoRA.ggmlv3.q4_0.bin q4_0 4bit 20.3GB 23GB 4-bit.

VicUnlocked-30B-LoRA.ggmlv3.q4_1.bin q4_1 5bit 24.4GB 27GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.

VicUnlocked-30B-LoRA.ggmlv3.q5_0.bin q5_0 5bit 22.4GB 25GB 5-bit. Higher accuracy, higher resource usage and slower inference.

VicUnlocked-30B-LoRA.ggmlv3.q5_1.bin q5_1 5bit 24.4GB 27GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.

VicUnlocked-30B-LoRA.ggmlv3.q8_0.bin q8_0 8bit 36.6GB 39GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.

Copied of some of The-Bloke's model descriptions on huggingface. With 16G you can run practically all 7B and 13B versions. With shared GPU+CPU inference, one can also offload some layers onto a GPU (not sure if that makes the initial RAM requirement smaller), but you do need CUDA of course.