|
|
|
|
|
by markussss
9 hours ago
|
|
My system is quite similar to your, my GPU is a 6950 XT and CPU a Ryzen 5 2600x, same amount of RAM, and I feel your pain. It sounds very similar to my experience from a few months ago. When it comes to tool calling, there are multiple possible issues; some models have borked templates bundled with the model file, some models are not trained on tool calling, some agent harnesses doesn't support the tool call output from the model very well, some quantizations ruin the models' abilities to call tools. My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama [1], learn a little about the parameters that tune how much VRAM is used [2], look online for jinja template fixes for the model you're testing [3], and choose a model that was designed to do the task you want to achieve, with as high quantization as you can fit. The maximum model size you can run is VRAM + RAM, although you want as little of the model to be in system RAM as possible. I'm running North Mini Code IQ3_XXS with some tuned parameters to fit my current tasks, and while it is not perfect for everything, it has not failed any tool calls I've asked it to make, or that it figured it should make on its own. [1]: https://sleepingrobots.com/dreams/stop-using-ollama/ [2]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv... [3]: https://gist.github.com/jscott3201/e4b155885cc68c038d6ac8909... |
|
For llms Sadly the only model right now that fits the bill for LLM is GPT 4.1 and it’s standard in my stack because thinking models have unacceptable latency(>=1 sec) even though they are good at tool calling. The main issue with 4.1 is that it can make still mistakes and prompt prose has to be tuned quite a bit.
I wonder if any local models can be tuned to match the response time and tool calling while supporting many languages.