| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by boroboro4 660 days ago
	There are different ways to run LLMs on multiple GPUs, one of them (called tensor parallelism) in low batch scenarios would be multiplying bandwidth between different GPUs. So no, 8 4090s is not 1000 GB/s.

2 comments

behnamoh 660 days ago

you've heard something and are regurgitating it without fully understanding it.

link

boroboro4 652 days ago

I’m developing inference engine, so I actually do understand how it works. As well as other types of parallelism and how exactly they do different trade offs

link

klohto 660 days ago

let me know how is the PCIe bandwidth treating you

link

boroboro4 652 days ago

Since we’re talking about small batch sizes PCIe bandwidth isn’t as important - intermediate hidden state is magnitude smaller than weights.

link