| HN Mirror

While I agree that throughput-focused scenarios exist and this work may be valuable for them, I still think that the repository can be improved to avoid "overselling".

The fact that the FlexGen's single-batch generation performance is much worse is unclear to most people not familiar with peculiarities of LLM inference and worth clarifying. Instead, the readme starts with mentioning ChatGPT and Codex - projects that both rely on single-batch inference of LLMs at interactive speeds, which is not really possible with FlexGen's offloading (given the speed mentioned in the parent comment). The batch sizes are not reported in the table as well.

Seeing that, I'm not surprised that most HN commenters misunderstood the project's contribution.