| Some projects such as lmdeploy[0] can quantize the KV cache[1] as well to save some VRAM. Speaking of lmdeploy, it doesn't seem to be widely known but it also supports quantization with AWQ[2] which appears to be superior to the more widely used GPTQ. The serving backend is Nvidia Triton Inference Server. Not only is Triton extremely fast and efficient, they have a custom TurboMind backend for Triton. With this lmdeploy delivers the best performance I've seen[3]. On my development workstation with an RTX 4090, llama2-chat-13b, AWQ int4, and KV cache int8: 8 concurrent sessions (batch 1): 580 tokens/s 1 concurrent session (batch 1): 105 tokens/s This is out of the box, I haven't spent any time further optimizing it. [0] - https://github.com/InternLM/lmdeploy [1] - https://github.com/InternLM/lmdeploy/blob/main/docs/en/kv_in... [2] - https://github.com/InternLM/lmdeploy/tree/main#quantization [3] - https://github.com/InternLM/lmdeploy/tree/main#performance |