|
|
|
|
|
by moffkalast
844 days ago
|
|
It's all mostly just inference, though some train LoRAs directly on quantized models too. GGML and GGUF are the same thing, GGUF is the new version that adds more data about the model so it's easy to support multiple architectures, and also includes prompt templates. These can run CPU only, be partially or fully offloaded to a GPU. With K quants, you can get anywhere from a 2 bit to an 8 bit GGUF. GPTQ was the GPU-only optimized quantization method that was superseded by AWQ, which is roughly 2x faster and now by EXL2 which is even better. These are usually only 4 bit. Safetensors and pytorch bin files are raw float16 model files, these are only really used for continued fine tuning. |
|
That sounds very convenient. What software makes use of the built-in prompt template?