Hacker News new | ask | show | jobs
by itissid 6 days ago
Interesting. Making low latency correct tool calls correctly is pretty important in voice AI cascading models(STT LLM TTS). Realtime Models are still 2x the cost and there are only 2 providers openai and google that are in the race. For cost control it has to be cascading models

For llms Sadly the only model right now that fits the bill for LLM is GPT 4.1 and it’s standard in my stack because thinking models have unacceptable latency(>=1 sec) even though they are good at tool calling. The main issue with 4.1 is that it can make still mistakes and prompt prose has to be tuned quite a bit.

I wonder if any local models can be tuned to match the response time and tool calling while supporting many languages.