| i don't think such a guide exists. this space is moving pretty fast. a short rundown quantized model formats: - GGML: used with llama.cpp, outdated, support is dropped or will be soon. cpu+gpu inference - GGUF: "new version" of the GGML file format, used with llama.cpp. cpu+gpu inference. offers 2-8bit quantization - GPTQ: pure gpu inference, used with AutoGPTQ, exllama, exllamav2, offers only 4 bit quantization - EXL2: pure gpu inference, used with exllamav2, offers 2-8bit quantization here[1] is a nice overview of VRAM usage vs perplexity of different quant levels (with the example of a 70b model in exl2 format) [1] https://old.reddit.com/r/LocalLLaMA/comments/178tzps/updated... |