Hacker News new | ask | show | jobs
by borzunov 1182 days ago
A Petals dev here. We say up front that "Single-batch inference runs at ≈ 1 sec per step (token)".

In turn, "parallel inference" refers to the high-throughput scenario when you generate multiple sequences in parallel. This is useful when you process some large dataset with LLM (e.g. run inference with batch size of 200) or run a beam search with a large beam width. In this case, you can actually get the speed of hundreds of tokens per sec, see our benchmarks for parallel forward passes: https://github.com/bigscience-workshop/petals#benchmarks

If you have another wording in mind that is more up front, please let us know, we'd be happy to improve the project description. Petals is a non-commercial research project, and we don't want to oversell anything.

1 comments

Can it run in a docker-compose container with a set ressource limit?

Do each node earn points for supplying resources that can then be spend for greater query / process speed?

Sure! The point system is being developed, we'll ship it soon. Once it's ready, you'll be able to spend points on high-priority requests to increase the speed.