| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Smith42 1197 days ago
	Since this is pytorch it should run on cpu anyway. What am I missing?

2 comments

progman32 1197 days ago

Reading the patch: https://github.com/facebookresearch/llama/compare/main...mar...

Looks like this is just tweaking some defaults and commenting out some code that enables cuda. It also switches to something called gloo, which I'm not familiar with. Seems like an alternate backend.

link

markasoftware 1197 days ago

you don't actually need to switch to gloo, I just have no idea what I'm doing.

link

refulgentis 1197 days ago

Lol, all my best work has been when I don’t know what I’m doing and it’s refreshing to see someone moving the ball forward and feeling the same way. Kudos

link

rajman187 1197 days ago

Gloo is a communication protocol for distributed computation (think along the lines of MPI)

link

Zetobal 1197 days ago

I guess the simple fact that it didn't before his patch?

link

cinntaile 1197 days ago

Usually you just trivially have the model run on cpu or gpu by simply writing .cpu() at specific places, so he's wondering why this isn't the case here.

link

markasoftware 1197 days ago

that's literally all I did (plus switching the tensor type). I'd imagine people are posting and upvoting this not because it's actually interesting code but rather just because it runs unexpectedly fast on consumer CPUs and it's not something they considered feasible before.

link

roenxi 1197 days ago

That is vastly underestimating how tricky it is to make novel pieces of software run. There is a huge fringe of people who know how to click things but not use the terminal and a large fringe of people who know how to run "./execute.bat" but not how to write syntactically correct Python.

But a lot of those people want to play with LLMs.

link

ComplexSystems 1197 days ago

How are you getting this to run fast? I'm on a top of the line M1 MBP and getting 1 token every 8 minutes.

link

ingenieroariel 1197 days ago

Try switching all the .cuda() to .mps() I got a 100x speedup on a different language model on a Macbook M1 Air.

https://pytorch.org/docs/stable/notes/mps.html

link

singularity2001 1197 days ago

dedicated fork: https://github.com/remixer-dec/llama-mps

link

markasoftware 1197 days ago

probably pytorch is very optimized to x86. It's likely using lots of SIMD and whatnot. I'm sure it's possible to get similar performance on m1 macs, but not with the current version of pytorch.

Do you have enough ram? (not swapping to disk)?

link

jwitthuhn 1196 days ago

Same experience for me, looks like it is only using one cpu core instead of all of them.

link

sva_ 1197 days ago

Or better yet, define a device = 'cpu', and use tensor.to(device).

link

tmalsburg2 1197 days ago

If someone else wrote this comment, would you find it useful?

link