| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by omneity 140 days ago

The model being 32B could run in <20GB VRAM with Q4 quantization (minimal loss of quality), or 80GB unquantized at full fidelity. The quoted 160GB is for their recommended evaluation settings.

There's a few pre-quantized options[0] or you can quantize it yourself with llama.cpp[1]. You can run the resulting gguf with llama.cpp `llama-cli` or `llama-server`, with LM Studio or with Ollama.

0: https://huggingface.co/models?search=cwm%20q4%20gguf

1: https://huggingface.co/spaces/ggml-org/gguf-my-repo

1 comments

chid 140 days ago

I see, still a fair more VRAM than I have access to. Thanks for sharing that information.

link