|
|
|
|
|
by magic_hamster
59 days ago
|
|
Ollama is the worst engine you could use for this. Since you are already running on an Nvidia stack for the dense model, you should serve this with vLLM. With 128GB you could try for the original safetensors even though you might need to be careful with caches and context length. |
|