| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tyfon 978 days ago
	I have 128 GB in my computer with a 5850x, it allows me to run and load the 180B falcon and 70B llama2 LLMs in llama.cpp, although with different quantization. Speed is actually not that bad either.

2 comments

mosselman 978 days ago

Is there some documentation on how to run this setup?

How fast is your setup?

link

rnk 978 days ago

I'm doing this on a mac studio with 128gb too. I'm using llama.cpp.

link

acchow 978 days ago

Since you get GPU acceleration (because of the unified memory), I imagine this is probably much faster than the PC setup?

Edit: Seems some people are getting 1-2.6 tokens/sec on Ryzen (no GPU acceleration), Llama 70B quantized https://www.reddit.com/r/LocalLLaMA/comments/15rqkuw/llama_2...

Whereas Mac Studio gets 13 tokens/sec https://blog.gopenai.com/how-to-deploy-llama-2-as-api-on-mac...

link

stoatmagoats 977 days ago

Friendly internet stranger’s input:

- you don’t get GPU acceleration just by using unified memory. Llama.cpp still only uses the CPU on Apple Silicon chips.

- the difference in tokens/sec is likely attributable to memory bandwidth. Mac Studios with the base Max chip have 400 GB/s memory bandwidth compared to around 50 GB/s for the Ryzen 5000 series CPUs

link

spott 977 days ago

Llama.cpp defaults to using metal. [0]

[0] https://github.com/ggerganov/llama.cpp#metal-build

link

acchow 978 days ago

What's your generation speed?

link