| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rapatel0 79 days ago
	I got qwen3.6:27B running on my 4090 (24GB) with ~128K context leveraging some of the recent turboquant/rotorquant memory optimizations for activations. Highly suggest going up to that. the q4_xl+rotorquant combo is pretty good. Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models

2 comments

altruios 79 days ago

What is your exp on performance +40k tokens? I've not gone past that as I've heard reports that were problems start to arise. I'd be happy to know your experience in that regard.

link

rapatel0 73 days ago

I'm super happy with the performance, I generally run with 2 parallel slots so I only get about 128K context window. My experience with all llms is that they get more forgetful if you use the full window. (256-512K is the sweet spot for frontier models, 128k works for me with this current qwen)

link

dmichulke 79 days ago

Forgive my ignorance but aren't they already on huggingface?

I assumed turboquant optimizations are already everywhere - in llama-cpp, or the quantization machinery of unsloth and the likes.

link

rapatel0 73 days ago

I forked it to also add rotorquant. This is a specific optimization that uses clifford rotors instead of static compile time random purmutation to store the activations. Reduces space and parameter count for the storage.

link