Tried out the Ollama version and it's insanely fast with really good results for 1.9GB size. Supposed to have a 1M context window, would be interested where the speed goes then.
> Last I checked Ollama inference is based on llama.cpp
Yes and no. They've written their own "engine" using GGML libraries directly, but fall back to llama.cpp for models the new engine doesn't yet support.