| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by suprjami 61 days ago

You don't.

With multiple cards in normal PCI express slots, LLM layers are split across cards.

When you run inference, it runs on one card then the other card. You can repeat this for as many cards as you want.

You only copy the activations between the cards which ~10 MB/sec at runtime so PCIe width or generation is irrelevant. Even PCIe 1.0 x1 would be sufficient.

There are other software optimisations (row split, tensor parallel) which require fast interlinks like NVLink but you can get a long way without any of that.