|
|
|
|
|
by bluecoconut
311 days ago
|
|
I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama. It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes. (This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help) |
|
Not sure about ollama, but llama-server does have a transparent kv cache.
You can run it with
Web UI at http://localhost:8080 (also OpenAI compatible API)