| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 0xbadcafebee 110 days ago
	Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though. Agree that k/v cache is underutilized by most folks. Ollama disables Flash Attention by default, so you need to enable it. Then the Ollama default quantization for k/v cache is fp16, you can drop to q8_0 in most cases. (https://mitjamartini.com/posts/ollama-kv-cache-quantization/) (https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...)