|
|
|
|
|
by borzunov
1210 days ago
|
|
Note that the authors report the speed of generating many sequences in parallel (per token): > The batch size is tuned to a value that maximizes the generation throughput for each system. > FlexGen cannot achieve its best throughput in [...] single-batch case. For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_. However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints. The authors don't report the speed of the single-batch generation in the repo and the paper. [1] https://arxiv.org/pdf/2209.01188.pdf |
|
These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...