| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by turmeric_root 1245 days ago
	> so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s? it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.

1 comments

q1w2 1245 days ago

> people are just going to be throwing pytorch code at the wall

The pytorch 2.0 nightly has a number of performance enhancements as well as ways to reduce the memory footprint needed.

But also, looking at the README, it appears that model alone needs 2x the model size, eg 65B needs 130GB NVRAM, PLUS the decoding cache which stores 2 * 2 * n_layers * max_batch_size * max_seq_len * n_heads * head_dim bytes = 17GB for the 7B model (not sure if it needs to increase for the 65B model), but maybe a total of 147GB total NVRAM for the 65B model.

That should fit on 4 Nvidia A40s. Did you get memory errors, or you haven't tried yet?

link

turmeric_root 1244 days ago

So since making that comment I managed to get 65B running on 1 x A100 80GB using 8-bit quantization. Though I did need ~130GB of regular RAM on top of it.

link

UltimateEdge 1244 days ago

So is the model any good?

link

turmeric_root 1243 days ago

It seems to be about as good as gpt3-davinci. I've had it generate React components and write crappy poetry about arbitrary topics. Though as expected, it's not very good at instructional prompts since it's not tuned for instruction.

People are also working on adding extra samplers to FB's inference code, I think a repetition penalty sampler will significantly improve quality.

The 7B model is also fun to play with, I've had it generate Youtube transcriptions for fictional videos and it's generally on-topic.

link