Hacker News new | ask | show | jobs
by AlexC04 60 days ago
One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though)

https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...

vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.

From what I understand, the steps are:

1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it

I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.

SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!