| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lxe 119 days ago
	I built something similar for Linux (yapyap — push-to-talk with whisper.cpp). The "local is too slow" argument doesn't hold up anymore if you have any GPU at all. whisper large-v3-turbo with CUDA on an RTX card transcribes a full paragraph in under a second. Even on CPU, parakeet is near-instant for short utterances.The "deep context" feature is clever, but screenshotting and sending to a cloud LLM feels like massive overkill for fixing name spelling. The accessibility API approach someone mentioned upthread is the right call — grab the focused field's content, nearby labels, window title. That's a tiny text prompt a 3B local model handles in milliseconds. No screenshots, no cloud, no latency.The real question with Groq-dependent tools: what happens when the free tier goes away? We've seen this movie before. Building on local models is slower today but doesn't have a rug-pull failure mode.

5 comments

wolvoleo 118 days ago

Yeah local works really fine. I tried this other tool: https://github.com/KoljaB/RealtimeVoiceChat which allows you to live chat with a (local) LLM. With local whisper and local LLM (8b llama in my case) it works phenomenally and it responds so quickly that it feels like it's interrupting me.

Too bad that tool no longer seems to be developed. Looking for something similar. But it's really nice to see what's possible with local models.

link

Wowfunhappy 119 days ago

> The "local is too slow" argument doesn't hold up anymore if you have any GPU at all.

By "any GPU" you mean a physical, dedicated GPU card, right?

That's not a small requirement, especially on Macs.

link

arach 119 days ago

My M1 16GB Mini and M2 16GB Air both deliver insane local transcription performance without eating up much memory - I think the M line + Parakeet delivers insane local performance and you get privacy for free

link

ghrl 118 days ago

Yeah, that model is amazing. It even runs reasonably well on my mid-range Android phone with this quite simple but very useful application, as long as you don't speak for too long or interrupt yourself for transcribing every once in a while. I do have handy.computer on my Mac too.

https://news.ycombinator.com/item?id=46640855

I find the model works surprisingly well and in my opinion surpasses all other models I've tried. Finally a model that can mostly understand my not-so-perfect English and handle language switching mid sentence (compare that to Gemini's voice input, which is literally THE WORST, always trying to transcribe in the wrong language and even if the language is correct produces the uttermost crap imaginable).

link

arach 118 days ago

Ack for dictations but Gemini voice is fun for interactive voice experiments -> https://hud.arach.dev/ honestly blown away by how much Gemini could assist with with basically no dev effort

link

0x457 114 days ago

On macs you actually don't need it as long as you have enough RAM.

I run 120M Parakeet model formt STT thing. Even that tiny model works much better than macos dictation these days.

link

grosswait 118 days ago

No. Give it a try I think you’ll be surprised

link

wazoox 118 days ago

I've installed murmure on my 2013 Mac, and it works through 1073 words/minute. I don't know about you, but that's plenty faster than me :D

link

h3lp 118 days ago

FWIW whisper.cpp with the default model works at 6x realtime transcription speed on my four-core ~2.4GHz laptop, and doesn't really stress CPU or memory. This is for batch transcribing podcasts.

The downside is that couldn't get it to segment for different speakers. The concensus seemed to be to use a separate tool.

link

BatteryMountain 118 days ago

I also built one.. mine is called whispy. I use mine to pump commands to claude. So far a bit hit & miss, still tweaking it.

link

lxe 116 days ago

Yeah, that's exactly what I started to do with mine. It runs local Whisper on a CUDA, on a graphics card. Whisper is actually better than any other model that I've seen, even things like Parakeet. It can do language detection. It automatically removes all the ahs and all the ohms unless I specifically enter them in my speech. I think this whole paragraph is going to take maybe half a second to process and paste without any issues.

(and it did it perfectly without any edits required for me at all.)

link

rafanocode 103 days ago

I did the same, called hapi. I also added meeting recordings + automations so i can use those voice notes to trigger stuff or repurpose them, or just save them anywhere i want.

link

nitroedge 118 days ago

Handy for me has worked wonders

link