|
|
|
|
|
by gcr
60 days ago
|
|
On TheTom’s llama-cpp fork, TurboQuant makes inference about five to ten times slower than vanilla (M1 Max, qwen3.6-35b-a3b). Seems like the productionization is still a ways away. Excited to see what we can get it down to though. |
|