| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oblio 338 days ago
	> And you should be able to get two and load half your model into each. It should be about the same speed as if a single card had 32GB. This seems super duper expensive and not really supported by the more reasonably priced Nvidia cards, though. SLI is deprecated, NVLink isn't available everywhere, etc.

1 comments

Dylan16807 338 days ago

No, no, nothing like that.

Every layer of an LLM runs separately and sequentially, and there isn't much data transfer between layers. If you wanted to, you could put each layer on a separate GPU with no real penalty. A single request will only run on one GPU at a time, so it won't go faster than a single GPU with a big RAM upgrade, but it won't go slower either.

link

oblio 338 days ago

Interesting, thank you for the feedback, it's definitely worth looking into!

link