|
|
|
|
|
by borzunov
1182 days ago
|
|
A Petals dev here. We say up front that "Single-batch inference runs at ≈ 1 sec per step (token)". In turn, "parallel inference" refers to the high-throughput scenario when you generate multiple sequences in parallel. This is useful when you process some large dataset with LLM (e.g. run inference with batch size of 200) or run a beam search with a large beam width. In this case, you can actually get the speed of hundreds of tokens per sec, see our benchmarks for parallel forward passes: https://github.com/bigscience-workshop/petals#benchmarks If you have another wording in mind that is more up front, please let us know, we'd be happy to improve the project description. Petals is a non-commercial research project, and we don't want to oversell anything. |
|
Do each node earn points for supplying resources that can then be spend for greater query / process speed?