Hacker News new | ask | show | jobs
by throwaway19423 842 days ago
Can any kind soul explain the difference between GGUF, GGML and all the other model packaging I am seeing these days? Was used to pth and the thing tf uses. Is this all to support inference or quantization? Who manages these formats or are they brewing organically?
3 comments

I think it's mostly an organic process arising from the ecosystem.

My personal way of understanding it is this - the original sin of model weight format complexity is that NNs are both data and computation.

Representing the computation as data is the hard part and that's where the simplicity falls apart. Do you embed the compute graph? If so, what do you do about different frameworks supporting overlapping but distinct operations. Do you need the artifact to make training reproducible? Well that's an even more complex computation that you have to serialize as data. And so on..

It's all mostly just inference, though some train LoRAs directly on quantized models too.

GGML and GGUF are the same thing, GGUF is the new version that adds more data about the model so it's easy to support multiple architectures, and also includes prompt templates. These can run CPU only, be partially or fully offloaded to a GPU. With K quants, you can get anywhere from a 2 bit to an 8 bit GGUF.

GPTQ was the GPU-only optimized quantization method that was superseded by AWQ, which is roughly 2x faster and now by EXL2 which is even better. These are usually only 4 bit.

Safetensors and pytorch bin files are raw float16 model files, these are only really used for continued fine tuning.

> and also includes prompt templates

That sounds very convenient. What software makes use of the built-in prompt template?

Of the ones I commonly use, I've only seen it read by text-generation-webui, in the GGML days it had a long hardcoded list of known models and which templates they use so they could be auto-selected (which was often wrong), but now it just grabs it from any model directly and sets it when it's loaded.
pth can include Python code (PyTorch code) for inference. TF includes the complete static graph.

GGUF is just weights, safetensors the same thing. GGUF doesn't need a JSON decoder for the format while safetensors needs that.

I personally think having a JSON decoder is not a big deal and make the format more amendable, given GGUF evolves too.