Transcribro: On-device Accurate Speech-to-text

Y	Hacker News new \| ask \| show \| jobs

	Transcribro: On-device Accurate Speech-to-text (github.com)
	165 points by thebiblelover7 699 days ago

8 comments

james2doyle 698 days ago

Looks similar to the new FUTO keyboard: https://voiceinput.futo.org/

link

iamjackg 698 days ago

I've been using this for a while (the voice input, not their keyboard) and it's so refreshing to be able to just speak and have the output come out as fully formed, well punctuated sentences with proper capitalization.

link

james2doyle 698 days ago

I agree. No more "speaking punctuation". Just talk as normal and it comes out fully formed

link

freedomben 698 days ago

I actually don't mind speaking punctuation, in fact it kind of helps. What I really hate is the middle-spot where we are right now, where it tries to place punctuation and sucks badly at it.

link

infinitezest 698 days ago

In my experience, futo is actually pretty good at just knowing the right punctuation to use.

link

leobg 698 days ago

Anything like that available for iOS?

link

crazygringo 698 days ago

iOS already has on-device dictation built into the standard keyboard.

Years ago it got sent to the cloud, but as long as you have an iPhone from the past few years it's on-device.

link

ttla 698 days ago

You're right that it exists, but it's complete crap outside a quiet environment. Try to use it while walking around outside or in any semi-noisy area and it fails horribly (iPhone 13, so YMMV if you have a newer one).

You cannot use an iPhone as a dictation device without reviewing the transcribed text, which IMO defeats the purpose of dictation.

Meanwhile, i've gotten excellent results on the iPhone from a Whipser->LLM pipeline.

link

crazygringo 697 days ago

I've never found real-time dictation software that doesn't need to be reviewed.

I'm definitely waiting for Apple to upgrade their dictation software to the next generation -- I have my own annoyances with it -- but I haven't found anything else that works way better, in real time, on a phone, that runs in the background (like as part of the keyboard).

You talk about Whisper but that doesn't even work in real time, much less when you have to run it through an LLM.

link

ttla 695 days ago

What's the real-time requirement for? We may have different use cases, but it's not needed if I don't need to review the results. Speak -> Send, without reviewing the text, is the desired workflow. I.e. so you can compose messages without looking at your phone.

So yes, i'm not sure of alternate real-time solutions, but the non real-time solution of Whisper is much better for my real-world use case.

link

brylie 698 days ago

Aiko, mentioned elsewhere, includes a local copy of the OpenAI Whisper model: https://apps.apple.com/app/aiko/id1672085276

link

b33f 698 days ago

Aiko is a free app for iOS and macOS that also uses whisper for local TTS

link

gala8y 696 days ago

There is also Sayboard (open-source, multiple languages): https://github.com/ElishaAz/Sayboard

link

kolme 698 days ago

This looks great! I've been wanting to drop the Swipe keyboard ever since I saw sneaky ads on it (like me typing "Google Maps" and getting "Bing Maps" as a "suggestion").

link

yjftsjthsd-h 698 days ago

But open source, which is a pretty big difference

link

grandma_tea 698 days ago

FUTO and Transcribro are open source.

link

Humbly8967 698 days ago

No, FUTO made a new "Source First License"[1] that is not Open Source by the OSI definition.

[1] https://github.com/futo-org/android-keyboard/blob/master/LIC...

link

observationist 697 days ago

I can get behind people doing their own custom "licenses" that amount to throwing their work into the public domain, but if someone builds their own limited licenses around a thing, I won't touch their product. This FUTO license is garbage. Use a real license and either be open source or not; inventing new personal licenses doesn't do anyone any good.

link

grandma_tea 698 days ago

Oh, that's lame.

link

yencabulator 698 days ago

FUTO is not open source.

https://gitlab.futo.org/alex/voiceinput/-/blob/master/LICENS...

> FUTO Source First License 1.0

> You may use or modify the software only for non-commercial purposes

link

flax 698 days ago

Documentation severely lacking. I wanted to know whether this does streaming or only batch, as well as examples for integrating with Android apps.

link

soupslurpr 697 days ago

It uses VAD and processes after it detects no speech for 3 seconds, so only batch. Examples for integrating with Android apps? Like apps that can use it? Pretty much any app that uses Android's SpeechRecognizer class if you set Transcribro as the user-selected speech recognizer or if the app uses Transcribro explicitly. For example, Google Maps uses the user-selected speech recognizer when it doesn't detect Google's speech services on the system.

link

pants2 698 days ago

Considering it uses Whisper, it's probably not streaming

link

refulgentis 698 days ago

I did some core work on TTS at Google, at several layers, and I've never quite understood what people mean by streaming vs. not.

In each and every case I'm familiar with, streaming means "send the whole audio thus far to the inference engine, inference it, and send back the transcription"

I have a Flutter library that does the same flow as this (though via ONNX, so I can cover all platforms), and Whisper + Silero is ~identical to the interfaces I used at Google.

If the idea is streaming is when each audio byte is only sent once to the server, there's still an audio buffer accumulated -- its just on the server.

link

opprobium 698 days ago

Streaming for TTS doesn't matter but for speech to text it is more meaningful in interactive cases. In that case the user's speech is arriving in real time and streaming can mean a couple levels of things:

- Overlap compute with the user speaking: Not having to wait until all the speech has been acquired can massively reduce latency at the end of speech and allow a larger model to be used. This doesn't have to be the whole system, for instance an encoder can run in this fashion along audio as it comes in even if the final step of the system then runs in a non-streaming fashion.

- Produce partial results while the user is speaking: This can be just a UI nice to have, but it can also be much deeper, eg, a system can be activating on words or phrases in the input before the user is finished speaking which can dramatically change latency.

- Better segmentation: Whisper + Silero is just using VAD to make segments for Whisper, this is not at all the best you can do if you are actually decoding while you go. Looking at the results as you go allow you to make much better and faster segmentation decisions.

link

refulgentis 698 days ago

The only models that do what you're poking at hostically are 4o (claimed) and that french company with the 7B one. They're also bleeding edge, either unreleased or released and way wilder, ex. The french one interrupts too much, and screams back in an alien language occasionally.

Until these, you'd use echo cancellation to try and allow interruptible dialogue, and thats unsolved, you need a consistently cooperative chipset vendor for that (read: wasn't possible even at scale, carrots, presumably sticks, and with nuch cajoling. So it works on iPhones consistently.)

The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming

I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"

link

opprobium 698 days ago

This is exactly what Google ASR does. Give it a try and watch how the results flow back to you, it certainly is not waiting for VAD segment breaking. I should know.

Streaming used to be something people cared about more. VAD is always part of those systems as well, you want to use it to start segments and to hard cut-off, but it is just the starting off point. It's kind of a big gap (to me) that's missing in available models since Whisper came out, partly I think because it does add to the complexity of using the model, and latency has to be tuned/traded-off with quality.

link

r2_pilot 698 days ago

Thank you for your insight. It confirms some of my suspicions working in this area (you wouldn't happen to know anybody who makes anything more modern than the Respeaker 4-mic array?). My biggest problem is even with AEC, the voice output is triggering the VAD and so it continually thinks it's getting interrupted by a human. My next attempt will be to try to only signal true VAD if there's also sound coming from anywhere but behind, where the speaker is. It's been an interesting challenge so far though.

link

Nimitz14 698 days ago

This is a complete non sequitur lol. FYI whisper is not a streaming model though it can, with some work, be adapted to be one.

link

iamjackg 698 days ago

I think in practical terms (at least for me):

- streaming == I talk and the text appears as I talk

- batched == I talk, and after I'm done talking some processing happens and the text gets populated

link

refulgentis 698 days ago

Gotcha, then, it's "not even wrong" in the Pauli sense to say Whisper isn't streaming

link

opprobium 698 days ago

It is not streaming in the way people normally use this term. It's a fuzzy notion but typically streaming means something encompassing:

- Processing and emitting results on something closer to word by word level - Allowing partial results while the user is still speaking and mid-segment - Not relying on an external segmenter to determine the chunking (and therefore also latency) of the output.

link

flax 698 days ago

"streaming" in this case is like another reply said: transcriptions appear as I talk. Compared to not-streaming in which the service waits for silence, then processes the captured speech, then returns some transcription.

Is your Flutter library available? And does it run locally? I'm looking for a good Flutter streaming (in the sense above) speech recognition library. vosk looks good, but it's lacking some configurability such as selecting audio source.

link

refulgentis 698 days ago

FONNX, haven't gone out of my way to make it trivial[1], but, it's very good, battle tested on every single platform. (And yes runs locally)

[1] example app shows how to do everything, there's basic doc, but man the amount of nonsense you need to know to pull it all together is just too hard to document without a specific Q. Do feel free to file an issue

link

yewenjie 698 days ago

Seems like Gboard is incompatible with it. Is there a good enough open source alternative to Gboard in 2024 that has smooth glide-typing and a similar layout?

link

SparkyMcUnicorn 698 days ago

Any of these should work.

https://github.com/Helium314/HeliBoard

https://github.com/openboard-team/openboard

https://github.com/rkkr/simple-keyboard (guessing, since AOSP Keyboard works and this is a fork)

Not open source: https://www.microsoft.com/en-us/swiftkey

Does not have glide/swipe (reserved for symbols), but I just installed and giving it a shot: https://github.com/Julow/Unexpected-Keyboard

link

Grimblewald 698 days ago

Unexpected keyboard is unexpectedly awesome. Looks a bit dated, but boy does it have some functionality packed into it.

link

nine_k 698 days ago

My choice is https://github.com/AnySoftKeyboard/AnySoftKeyboard/

It does have glide typing, even.though I don't use it.

It rather uses long-tap to access multiple symbols, and can be split or pushed to a corner on devices with a big screen.

link

smeej 697 days ago

Not sure what I'm doing wrong, but I tried installing it on a GrapheneOS device with Play Services installed and nothing happened. When I pushed the mic button, it changed to look pressed for a second, and went back to normal. Nothing happened when I spoke. Tried holding it down while speaking. Still nothing.

I'm very interested in using this, but I can't even find a way to try to troubleshoot it. I'm not finding usage instructions, never mind any kind of error messages. It just doesn't do anything.

This is especially interesting to me because the screenshot on the repo is from Vanadium, which strongly suggests to me that it's from a GrapheneOS device itself.

link

soupslurpr 697 days ago

You're correct I do use GrapheneOS. Hm do you have the global microphone toggle off? There's an upstream issue that causes SpeechRecognizer implementations to silently fail when the microphone toggle is off. You may have to force-stop Transcribro after turning it on.

https://github.com/soupslurpr/Transcribro/issues/3

link

smeej 697 days ago

I didn't think I did, but cycling it a couple times and restarting did fix! Great guess!

The thing I'm tripping over now is just that I keep pressing the button more than once when I'm done speaking because it's not clear that it registered the first time. If it could even just stay "pressed" or something while it processes the text, I think that would make it clearer. Any third state for the button would do I think.

Looking forward to using this! Thanks!

link

soupslurpr 697 days ago

Good to hear its working.

Ah, it currently uses the Jetpack Compose toggle button but I do suppose it does actually have three states instead of two. I initially wanted to add a loading circle inside the button but wasn't able to without messing up the padding and such.

Hope you enjoy using Transcribro!

link

lawgimenez 698 days ago

This is cool, I get to read another Jetpack Compose codebase since I am halfway through migrating our app to Jetpack. So this helps a lot.

link

tmaly 698 days ago

I wish there was something where I could transcribe iPhone voice memos to text.

I would pay for an app that did this.

link

cee_el123 698 days ago

Google has an app called live transcribe on Android but there's no iPhone version

This is an unaffiliated version looks like https://apps.apple.com/us/app/live-transcribe/id1471473738

link

hidelooktropic 698 days ago

The microphone icon on the keyboard does this.

link

swyx 698 days ago

is there an iPhone version of this? custom keyboard?

link

crancher 698 days ago

Accrescent hype is comically overdone.

link

free_bip 698 days ago

I looked in the GitHub issues and there's a closed issue for F-droid inclusion. The author states that F-droid "Doesn't meet their requirements" but doesn't elaborate. I wonder what F-droid is missing that they need so much?

link

okso 698 days ago

F-Droid only packages open-source software and rebuilds it from source, while installing from Accrescent would move all trust to the developer, even if the license changes to proprietary.

I understand that the author trusts itself more than F-Droid, but as a user the opposite seems more relevant.

link

ementally 698 days ago

Reason https://www.privacyguides.org/en/android/#f-droid

link

ktosobcy 693 days ago

Author also points to https://privsec.dev/posts/android/f-droid-security-issues/

I'm not really sold on the argument... Also constant push/hype of GrapheneOS (and the "attitude" of it's devs) is mildly annoying...

link

okso 698 days ago

Link: https://github.com/soupslurpr/Transcribro/issues/9

link

mijoharas 698 days ago

I only just saw it from this project.

I see the features listed[0] which seems like a reasonable feature set, but nothing unusual afaict.

If there has been a lot of hype can you tell me what people find compelling about it?

[0] https://accrescent.app/

link