| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oktoberpaard 335 days ago
	It gives weird results for me. I’m using Qwen3-32B with 32K context length at Q4_K_M, with 8 bit KV cache fully offloaded to 24GB VRAM. According to this calculator this should be impossible by a large margin, yet it’s working for me. Edit: this might be because I’ve got flash attention enabled in Ollama.