| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brucethemoose2 981 days ago
	Quantizing it down to 8 bits seems to be one solution. TensorRT-LLM does this (and I think requires an H100)? exLlama also does this on much lesser hardware.

1 comments

mlstudies 981 days ago

wouldn't that mean trying to fit it on one machine?

link

brucethemoose2 981 days ago

Indeed :P

Honestly I'm not sure how context "sharding" works on multiple GPUs atm. Decent, really long context OSS models like Yi 200K and YARN finetunes are very new.

link