| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by blks 105 days ago
	> "open safari" (safari opens, voice says: "I opened safari") "navigate to google.com in safari" (nothing happens, voice says: "I navigated to google.com") So you’re describing a core broken feature. Application breaking at easiest test.

1 comments

sanchitmonga22 105 days ago

Fair criticism. The action executed on the LLM side but didn't translate to the correct macOS action, the model hallucinated success instead of routing to the open_url tool.

This is a known limitation with small LLMs (0.6B-1.2B) doing tool calling. They sometimes confuse "I know what you want" with "I did it." Upgrading to a larger model improves tool-calling accuracy significantly.

We're also working on verification, having the pipeline confirm the action actually succeeded before reporting back. Thats a fair expectation and we should meet it.

link

elpakal 105 days ago

> This is a known limitation with small LLMs (0.6B-1.2B) doing tool calling.

To me this is this nut to crack, wrt tool calling and locally running inference. This seems like a really cool project and I'm going to dive around a little later but if it's hallucinating for something as basic as this makes me think it's more of POC stage right now (to echo other sentiment here).

link

sanchitmonga22 105 days ago

That's a fair read. Tool calling reliability with sub-4B models is genuinely the hardest unsolved problem in on-device AI right now.

The inference engine (MetalRT) is production-grade, the pipeline architecture is solid, but the models at this size are still the weak link for complex tool routing. Larger model support (where tool calling is much more reliable) is next on the roadmap. Please stay tuned!

link

elpakal 105 days ago

Sorry, I scrolled through some of the rest of the comments on this thread and can’t stay tuned.

link