|
|
|
|
|
by czl
53 days ago
|
|
> "--speculative-config", Regarding that last option:
speculation helps max concurrency when it replaces many memory-expensive serial decode rounds with fewer verifier rounds, and the proposer is cheap enough. It hurts when you are already compute-saturated or the acceptance rate is too low. Good idea to benchmark a workload with and without speculative decoding. |
|