| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by armcat 110 days ago
	This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova

1 comments

nicktikhonov 110 days ago

Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like

link

alfalfasprout 110 days ago

Also consider using Cerebras' inference APIs. They released a voice demo a while back and the latency of their model inference is insane.

link

ilaksh 109 days ago

I tried to use Cerebras and it was unbeatable at first, but the client didn't want to pay $1300 a month and the $50/month or pay as you go was just not reliable. It would give service unavailable errors or falsely claim we were over our rate limit.

Also Groq is very fast, but the latency wasn't always consistent and I saw some very strange responses on a few calls that I had to attribute to quantization.

link

riquito 109 days ago

You may be interested in gemini-2.5-flash-preview-tts

Text in, audio out, so you can merge in a single step LLM+TTS (streamable)

https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flas...

link