I'm curious - what made you choose deepgram over just running whisper? I don't have any experience with deepgram but whisper has worked so well in my own tests that I didn't even ever consider there might be API speech recognition-only companies.
There's three fantastic niche players in the speech-to-text market right now that you should check out:
- Deepgram (cheap and dirty, but accuracy quite poor)
- Speechmatics (a bit more pricey, but fantastic accuracy)
- Assembly AI (just announced Series C funding of $50m)
well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.
Among local solutions,
Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it.
VOSK, DeepSpeech, Kaldi, et al were good once upon a time...
Picovoice seems to be doing well.
unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].
have you tried it?
i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.
but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.
can you use it? yes
would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure.
would i use it in my money-generating application? absolutely not.
Thanks! There are ways to shave off the latency: hosting locally, using quantized/smaller models, streaming data instead of doing the tasks sequentially
I wonder if we're at a point where you could build a voice assistant like that, except almost-realtime and streamed end to end:
User speaks and speech to text starts streaming text while the user is still speaking. That text stream is piped into a LLM, which also streams its output text. That output text is streamed to text-to-speech, which also generates audio in a streaming manner.
The speech recognition part needs work for sure, but when it works you can see the potential. It's very different from the way it feels to talk to Siri or even ChatGPT's voice mode. It won't be long before we are having real conversations with our computers.
The end-to-end response latency is around 1 second typically. It listens continuously, there are no buttons to press, and you can interrupt it while it's talking.
TTS and STT models have decent support for streaming in chunks, but the accuracy drops the smaller the chunk size. Current state of LLMs are pretty limited in their ability to handle streaming inputs due to attention window constraints. There is some emerging research into attention sinks and caching initial tokens that look promising. I don't think we're quite there yet though.
You can do the "almost-realtime" part, all locally. I tinkered with a Python script for a few hours that used Whisper to speech-to-text, fed that into a local Mistral model (don't recall which), and then piped the output into text-to-speech.
It wasn't really streamed, though. Audio input was buffered, fully evaluated to a string, then fed into the LLM and the full text was converted back to audio.
The Whisper speech-to-text was pretty real-time, the LLM was not. I was barely scraping by on hardware specs, though.
This has happened already. It was maybe about 7 months ago and I believe it was a twitter link posted here. They took it further and streamed it to twilio to create a live phone call.
This one looks similar to this but also does internet search, mail, music, smart home etc.. Hope that there is a standard interface that gets plugged into it. so anyone can develop addons.
Yes, was searching for Realtime STT and got a hit on GitHub , then looked at his other projects and found he builts up on his STT and TTS projects, it’s just 2 second latency on Local voice chat. Which is very good .
Do you happen to know anything about any open source voice identification software?
I’ve noticed with ChatGPT voice and any other voice driven assistant that a massive problem is the background voices and noise. One solution could be advanced pre-processing to ID your voice only.
Another idea I’ve had is using something professional with PTT:
Yeah github.com/KoljaB is quite a collection of stuff! I agree.
It all seems your vision of JARVIS, which I share completely but haven't accomplished what you have, again excellent work and thank you for sharing, is very attainable. Probably combining your work along with KoljaB is very promising.
Somewhat amusing to consider that the (in-character) Marvel Cinematic Universe JARVIS could have been an LLM!
And of course Ultron is an asshole, it was trained on input from Tony Stark!
Back in 2008/9 I wondered just what would be required to run JARVIS, something you could converse with naturally, would understand what you meant, and be able to take care of complex mechanical tasks. The Iron Man suits have always been mostly Do-What-I-Mean (DWIM) managed by JARVIS or other AI agents, and now all of that seems to be attainable.
It's going to be an interesting time discovering just how well a human and AI agent can work together. I could see a military personal spotter, keeping track of enemy combatants, managing larger awareness of the battlefield, etc. I wonder how much a soldier could safely offload?
Exactly my thought, I was like "Jarvis has got to be just a 2030 version of an LLM".
Yeah I actually considered making a spotter AI using computer vision in a game like ARMA 3 or Squad but kind of difficult. I made a spotter for ground vehicles on aerial imagery using YOLOv5 here:
https://github.com/AlexandreSajus/Military-Vehicles-Image-Re...
There's a French defense company, Preligens, that actually does this currently
We use this exact stack at work (OpenAI, ElevenLabs, Deepgram) for some exploratory use cases. The key issue we have now is latency with the LLM. Deepgram and Elevanlabs work brilliantly!
Problem with this is Deepgram's accuracy (but agree their speed/latency is excellent).
We used to use them too, but eventually we got so frustrated with poor accuracy we switched to Speechmatics - would definitely recommend checking them out.
We do live in-studio briefings 3x/wk. These are both in-person and live-broadcast. The first thing we did was add an AI Co-Briefer who sits on the panel. The LLM latency makes it a bit hard, but it was a good experiment. The Deepgram worked brilliantly well with transcription across the entire studio, even for un-microphoned guest participants.
That live broadcast created a lot of buzz and numerous other use cases have popped up across the company. I'm working on a tech blog showcase next week to show it off on HN hopefully!
Not sure if anyone has noticed but OpenAI now has a mobile app (I've been using the PWA all this time) and the voice assistant on there is really strong. Sounds good, fast, and seems to even run a pass on my voice before it submits the query.
I built this kind of thing for GPT-3 way back and then repurposed for 3.5 when I got API access to that. Though I used Whisper. I was hoping this would have wake word handling because that was what I struggled with but it appears that it just starts listening when you click a button or something.
I assume it could be made more responsive by using a streaming text-to-speech synthesis like ElevenLabs Cheetah. This approach was taken by the RoboDad recently discussed on HN. Btw, is there a streaming text-to-speech tool that supports languages other than english?
Here I was thinking about putting something like this in my home, and jokingly calling it Jarvis.
This will be a great starting point, shame you can't choose the models you want to talk to (ie use local models instead of OpenAI), but great nonetheless !
Actually there are a few LLM wrappers around that use the openai API spec (localai is a good one)... so you could just allow a configurable openai endpoint URI and technically users can swap in any model.
If you're really sitting at the computer for 24 hours...I have a family member that died from blood clots that formed when he sitting at his computer for too long.
IANAIPL but I find it difficult to believe that trademark is valid, considering it's never been used in trade. There is no Jarvis digital assistant software sold either fictionally or IRL. Even if the trademark were somehow upheld, I don't see how there could be any damages.
Trademarks are industry specific. If you made a fictional AI character named Jarvis and tried selling media based on that character THEN Marvel has a case. Creating a talking AI Assistant named Jarvis would be an expensive court case, which Marvel/Disney has to cash to pursue, but it would be a legal stretch with a lot of moneyed interests willing to back the non-Marvel/Disney side.
> Computer application software that may be downloaded via global computer networks and electronic communication networks for use in connection with mobile computers, mobile phones, and tablet computers, namely, software for use as a voice controlled personal digital assistant
So many that it is actually quite counterproductive to call it that way. I honestly have lost track of how many AI-based assistants named JARVIS I have encountered already =/.
The way these things usually (but not always) work is they'll send you a cease and desist letter if they intend on bothering you. Change the name at that point and you're usually good.