|
|
|
|
|
by erichocean
105 days ago
|
|
Partially true, you can predict multiple tokens and confirm, which typically gives a 2-3x speedup in practice. (Confirmation is faster than prediction.) Many models architectures are specifically designed to make this efficient. --- Separately, your statement is only true for the same gen hardware, interconnects, and quantization. |
|