| HN Mirror

The lazy way is to use text-generation-webui, use an exllamav2 conversion of your model, and turn down context length until it fits (and tick the 8 bit cache option). If you go over your vram it will cut your speed substantially. Like 60/s down to 15/s for an extra 500 context length over what fits. Similar idea applies to any other backends, but you need to shove all the layers into vram if you want decent tok/s. To give you a starting point, typically for 7b models I have to use 4k-6k context length and I use 4-6 bit quantizations for an 8gb gpu. So start at 4 bit, 4k context and adjust up as you can.

You can find most popular models converted for you on huggingface.co if you add exl2 to your search and start with the 4 bit quantized version. Don't bother going above 6 bits even if you have spare vram, practically it doesn't offer much benefit.

For reference I max out around 60 tok/s at 4bit, 50 tok/s at 5bit, 40 at 6bit for some random 7b parameter model on a rtx 2070.