| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by EagnaIonat 257 days ago
	Tried out the Ollama version and it's insanely fast with really good results for 1.9GB size. Supposed to have a 1M context window, would be interested where the speed goes then. No Mamba in the Ollama version though.

2 comments

mehdibl 257 days ago

Ollama default to Q4 usually and 8/16k context and not the 1M context

link

Flere-Imsaho 257 days ago

(I've only just starting running local LLMs so excuse the dumb question).

Would Granite run with llama.cpp and use Mamba?

link

RossBencina 257 days ago

Last I checked Ollama inference is based on llama.cpp so either Ollama has not caught up yet, or the answer is no.

EDIT: Looks like Granite 4 hybrid architecture support was added to llama.cpp back in May: https://github.com/ggml-org/llama.cpp/pull/13550

link

magicalhippo 257 days ago

> Last I checked Ollama inference is based on llama.cpp

Yes and no. They've written their own "engine" using GGML libraries directly, but fall back to llama.cpp for models the new engine doesn't yet support.

link