Hacker News new | ask | show | jobs
by p1esk 785 days ago
Of course. I thought Nvidia GPUs are pretty much a must have to play with DL models.
2 comments

Well being able to run these models on CPU was pretty much the revolutionary part of llama.cpp.
I can run them on CPU - HF uses plain Pytorch code - fully supported on CPU.
But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.
Are there benchmarks? 2x speed up would not be enough for me to return to c++ hell, but 5x might be, in some circumstances.
I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.
Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?

My understanding is the modern quantization algorithms are typically implemented in Pytorch.

There's a Python binding for llama.cpp which is actively maintained and has worked well for me: https://github.com/abetlen/llama-cpp-python
Ollama supports many radeons now. And I guess llama.cpp does too, after all it's what ollama uses as backend.
PyTorch (the underlying framework of HF) supports AMD as well, though I haven’t tried it.