|
|
|
|
|
by regularfry
150 days ago
|
|
It's in the ollama library at q4_K_M, which doesn't quite fit on my 4090 with the default context length. But it only offloads 8 layers to the CPU for me. I'm getting usable enough token rates. That's probably the easiest way to get it. Not tried it with vllm but if it proves good enough to stick with then I might give it a try. |
|
I'm thinking of giving it a go with aider, but using something like gemma3:27b as the architect. I don't think you can have different models for different skills in opencode, but with smaller local models I suspect it's unavoidable for now.