| I like work in this area, and this is really helpful, thanks. I actively avoid cloud based LLMs and mainly use 4b - 30a3b param local models. This means I don't really have a good grasp of SOTA LLM performance or accuracy, but I know what to expect when dealing with local models, and where the pain points are. I've only skimmed the post and read the abstract and in some places you make a nod to how simple tweaks can make something 10x faster/slower, but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed. Specifically for agentic workflows and local models, accuracy around function/tool calling hasn't been a problem for me now for about 6 - 12 months, personally, since around QwenCoder3. The main issue is context management and the impact on timing, since agents will often swap prompts and break prompt caching and similar timing improvements. It looks like your work adds a layers and wrappers like guard rails and retries. This would make my local model experience - specifically for agents - unusable because of the delays it would add. I really appreciate and respect the work you've done, and apologies if you have already addressed this head on, but with so little talk about the impact on timing here, I feel like you're hiding something or overinflating the actual real world improvements here - what are your thoughts? It's also mildly concerning me that nobody else has raised this - am I doing something wrong here, or is everyone else just not actually using local models in real life?! Talk to me about your speed experiences! |
…but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed."
I wonder, if you were to use cloud-based LLMs more often, you might find that accuracy (fidelity?) is indeed more more lacking in your local models.
You can always just throw hardware at your speed problems after all.