|
|
|
|
|
by yorwba
105 days ago
|
|
There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though. |
|
(Confirmation is faster than prediction.)
Many models architectures are specifically designed to make this efficient.
---
Separately, your statement is only true for the same gen hardware, interconnects, and quantization.