| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by TheMatten 956 days ago
	I can reasonably run (quantized) Mistral-7B on a 16GB machine without GPU, using ollama. Are you sure it isn't a configuration error or bug?

1 comments

ilaksh 956 days ago

How many tokens per second and what are the specs of the machine? My attempts at CPU only have been really slow.

link

berkut 956 days ago

In my experience with llama.cpp using the CPU (on Linux) is very slow compared to GPU or NPU with the same models as my M1 MacBook Pro using Metal (or maybe it's the shared memory allowing the speedup?).

Even with 12 threads of my 5900X (I've tried using the full 24 SMT - that doesn't really seem to help) with the dolphin-2.5-mixtral-8x7b.Q5_K_M model, my MacBook Pro is around 5-6x faster in terms of tokens per second...

link

ilaksh 956 days ago

I think that Metal or something is actually a built in graphics/matrix accelerator that those Macs have now. It's not really using a CPU although it seems like Apple may be trying to market it a little bit as though it's just a powerful CPU. But more like accelerator integrated with CPU.

But whatever it is, it's great, and I hope that Intel and AMD will catch up.

AMD has had the APUs for awhile but I think they aren't at the same level at all as the new Mac acceleration.

link

stavros 956 days ago

There must be something wrong, my 3060 does double the tokens per second as my M2 Mac (with Metal).

link

TheMatten 956 days ago

Seems to be around 3 tokens/s on my laptop, which is faster than average human, but not too fast of course. On a desktop with mid-range GPU used for offloading, I can get around 12 tokens/s, which is plenty fast for chatting.

link