| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by moffkalast 890 days ago

It's all mostly just inference, though some train LoRAs directly on quantized models too.

GGML and GGUF are the same thing, GGUF is the new version that adds more data about the model so it's easy to support multiple architectures, and also includes prompt templates. These can run CPU only, be partially or fully offloaded to a GPU. With K quants, you can get anywhere from a 2 bit to an 8 bit GGUF.

GPTQ was the GPU-only optimized quantization method that was superseded by AWQ, which is roughly 2x faster and now by EXL2 which is even better. These are usually only 4 bit.

Safetensors and pytorch bin files are raw float16 model files, these are only really used for continued fine tuning.

1 comments

Gracana 890 days ago

> and also includes prompt templates

That sounds very convenient. What software makes use of the built-in prompt template?

link

moffkalast 890 days ago

Of the ones I commonly use, I've only seen it read by text-generation-webui, in the GGML days it had a long hardcoded list of known models and which templates they use so they could be auto-selected (which was often wrong), but now it just grabs it from any model directly and sets it when it's loaded.

link