|
|
|
|
|
by AlexC04
60 days ago
|
|
One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though) https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer... vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x. From what I understand, the steps are: 1. launch vLLM
2. execute a vLLM configure command like "use kv-turboquant for model xyz"
3. that's it I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true. SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it! |
|