| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by magic_hamster 106 days ago
	Ollama is the worst engine you could use for this. Since you are already running on an Nvidia stack for the dense model, you should serve this with vLLM. With 128GB you could try for the original safetensors even though you might need to be careful with caches and context length.

1 comments

fortyseven 106 days ago

Strangely, I haven't had a lot of luck with vLLM; I finally ended up ditching Ollama and going straight to the tap with llama-serve in llamacpp. No regrets.

link

magic_hamster 103 days ago

Good job. llama.cpp is already much better.

link