Hacker News new | ask | show | jobs
Jarvis: A Voice Virtual Assistant in Python (OpenAI, ElevenLabs, Deepgram) (github.com)
83 points by Alyx1337 910 days ago
14 comments

I'm curious - what made you choose deepgram over just running whisper? I don't have any experience with deepgram but whisper has worked so well in my own tests that I didn't even ever consider there might be API speech recognition-only companies.
There's three fantastic niche players in the speech-to-text market right now that you should check out: - Deepgram (cheap and dirty, but accuracy quite poor) - Speechmatics (a bit more pricey, but fantastic accuracy) - Assembly AI (just announced Series C funding of $50m)
Exactly, I don't think this project uses VAD for pausing LLM generation or interruptions in general which is key to good assistant interactions
Deepgram advertised itself as being the fastest, and I wanted to focus on limiting response delay so I chose it. I hope I did not get misled.
well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.

Among local solutions, Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it. VOSK, DeepSpeech, Kaldi, et al were good once upon a time... Picovoice seems to be doing well.

I was planning to work on this: https://picovoice.ai/blog/chatgpt-ai-virtual-assistant-in-py... using Eleven Labs and Cheetah. Hope I can crave some time

unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].

[0]: https://github.com/ggerganov/whisper.cpp/tree/master/example...

[1]: https://github.com/ggerganov/whisper.cpp/blob/master/example...

[2]: https://whisper.ggerganov.com/stream/

have you tried it? i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.

but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.

can you use it? yes would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure. would i use it in my money-generating application? absolutely not.

Wonderful hack, the overall response latency is the only thing that hurts the UX, if you can get the response time down would be epic. Nice work.
Thanks! There are ways to shave off the latency: hosting locally, using quantized/smaller models, streaming data instead of doing the tasks sequentially
I wonder if we're at a point where you could build a voice assistant like that, except almost-realtime and streamed end to end:

User speaks and speech to text starts streaming text while the user is still speaking. That text stream is piped into a LLM, which also streams its output text. That output text is streamed to text-to-speech, which also generates audio in a streaming manner.

I implemented this! All local models. And I packaged it up so people can install it with one click: https://apps.microsoft.com/detail/9NC624PBFGB7

The speech recognition part needs work for sure, but when it works you can see the potential. It's very different from the way it feels to talk to Siri or even ChatGPT's voice mode. It won't be long before we are having real conversations with our computers.

Could you record a demo of this?
I really should! I'm not the type to publish videos of myself usually, but it really does need a video demo.
But how realtime is it?
The end-to-end response latency is around 1 second typically. It listens continuously, there are no buttons to press, and you can interrupt it while it's talking.
TTS and STT models have decent support for streaming in chunks, but the accuracy drops the smaller the chunk size. Current state of LLMs are pretty limited in their ability to handle streaming inputs due to attention window constraints. There is some emerging research into attention sinks and caching initial tokens that look promising. I don't think we're quite there yet though.
You can do the "almost-realtime" part, all locally. I tinkered with a Python script for a few hours that used Whisper to speech-to-text, fed that into a local Mistral model (don't recall which), and then piped the output into text-to-speech.

It wasn't really streamed, though. Audio input was buffered, fully evaluated to a string, then fed into the LLM and the full text was converted back to audio.

The Whisper speech-to-text was pretty real-time, the LLM was not. I was barely scraping by on hardware specs, though.

you try using ESP box?
Available as a phone line API (https://www.vocode.dev) and OS project (https://github.com/vocodedev/vocode-python)
This has happened already. It was maybe about 7 months ago and I believe it was a twitter link posted here. They took it further and streamed it to twilio to create a live phone call.
The one I tried was called Vocode
Any existing stream api for llm input?
Found Similar project but fully local for anyone interested.

https://github.com/KoljaB/LocalAIVoiceChat

This one looks similar to this but also does internet search, mail, music, smart home etc.. Hope that there is a standard interface that gets plugged into it. so anyone can develop addons.

https://github.com/KoljaB/Linguflex

How did you find these? I was literally looking for tutorials all day long and could not find something. These projects look insane!
Yes, was searching for Realtime STT and got a hit on GitHub , then looked at his other projects and found he builts up on his STT and TTS projects, it’s just 2 second latency on Local voice chat. Which is very good .
Here is a video demo of the project: https://youtu.be/aIg4-eL9ATc?si=66ynl4Mlci9v76rU
Nice work! Very impressed.

Do you happen to know anything about any open source voice identification software?

I’ve noticed with ChatGPT voice and any other voice driven assistant that a massive problem is the background voices and noise. One solution could be advanced pre-processing to ID your voice only.

Another idea I’ve had is using something professional with PTT:

https://sheepdogmics.com/products/quick-disconnect-mic-tubel...

Google Gemini was trained on audio and can generate audio directly. Whatever you build now will be replaced by a much better version soon.
Thanks! I don't know a lot about this but someone shared this local voice assistant in the comments: https://github.com/KoljaB/LocalAIVoiceChat Could be a good lead
Yeah github.com/KoljaB is quite a collection of stuff! I agree.

It all seems your vision of JARVIS, which I share completely but haven't accomplished what you have, again excellent work and thank you for sharing, is very attainable. Probably combining your work along with KoljaB is very promising.

Thank you very much!
Somewhat amusing to consider that the (in-character) Marvel Cinematic Universe JARVIS could have been an LLM!

And of course Ultron is an asshole, it was trained on input from Tony Stark!

Back in 2008/9 I wondered just what would be required to run JARVIS, something you could converse with naturally, would understand what you meant, and be able to take care of complex mechanical tasks. The Iron Man suits have always been mostly Do-What-I-Mean (DWIM) managed by JARVIS or other AI agents, and now all of that seems to be attainable.

It's going to be an interesting time discovering just how well a human and AI agent can work together. I could see a military personal spotter, keeping track of enemy combatants, managing larger awareness of the battlefield, etc. I wonder how much a soldier could safely offload?

Exactly my thought, I was like "Jarvis has got to be just a 2030 version of an LLM".

Yeah I actually considered making a spotter AI using computer vision in a game like ARMA 3 or Squad but kind of difficult. I made a spotter for ground vehicles on aerial imagery using YOLOv5 here: https://github.com/AlexandreSajus/Military-Vehicles-Image-Re...

There's a French defense company, Preligens, that actually does this currently

I imagine that within the next couple of years there's going to be a "general purpose vision" model (GPV? :)

More of a framework to perform the general purpose task of "recognize things in 30 (60? 120?) frames per second video and act on events in the video"

We use this exact stack at work (OpenAI, ElevenLabs, Deepgram) for some exploratory use cases. The key issue we have now is latency with the LLM. Deepgram and Elevanlabs work brilliantly!
Problem with this is Deepgram's accuracy (but agree their speed/latency is excellent). We used to use them too, but eventually we got so frustrated with poor accuracy we switched to Speechmatics - would definitely recommend checking them out.
Great! What do you guys have in mind in terms of products using these tools. Yeah unfortunately it's hard to shave on latency.
We do live in-studio briefings 3x/wk. These are both in-person and live-broadcast. The first thing we did was add an AI Co-Briefer who sits on the panel. The LLM latency makes it a bit hard, but it was a good experiment. The Deepgram worked brilliantly well with transcription across the entire studio, even for un-microphoned guest participants.

That live broadcast created a lot of buzz and numerous other use cases have popped up across the company. I'm working on a tech blog showcase next week to show it off on HN hopefully!

There is another one (Also Jarvis) that's been around for a while and is more useful, wonder if they can combine forces? https://github.com/ggeop/Python-ai-assistant

Not sure if anyone has noticed but OpenAI now has a mobile app (I've been using the PWA all this time) and the voice assistant on there is really strong. Sounds good, fast, and seems to even run a pass on my voice before it submits the query.

I built this kind of thing for GPT-3 way back and then repurposed for 3.5 when I got API access to that. Though I used Whisper. I was hoping this would have wake word handling because that was what I struggled with but it appears that it just starts listening when you click a button or something.
Yeah I had the same issue so I used (stole) this answer on StackOverflow: https://stackoverflow.com/questions/46734345/python-record-o... Basically there's a library that records until it detects a silence
I assume it could be made more responsive by using a streaming text-to-speech synthesis like ElevenLabs Cheetah. This approach was taken by the RoboDad recently discussed on HN. Btw, is there a streaming text-to-speech tool that supports languages other than english?
Here I was thinking about putting something like this in my home, and jokingly calling it Jarvis. This will be a great starting point, shame you can't choose the models you want to talk to (ie use local models instead of OpenAI), but great nonetheless !
That was exactly my thought haha, I want Jarvis at home. You could easily modify my code to run a local LLM instead
Actually there are a few LLM wrappers around that use the openai API spec (localai is a good one)... so you could just allow a configurable openai endpoint URI and technically users can swap in any model.
Also check out Willow- https://heywillow.io

It doesn’t synthesize voice back (yet) but open source and runs all offline on ESP32-based hardware and works with HomeAssistant!

Great tool. I also created Kel - AI assistant for terminal. Please check https://kel.qainsights.com
If you're really sitting at the computer for 24 hours...I have a family member that died from blood clots that formed when he sitting at his computer for too long.
"Jarvis" is a trademark of Marvel, so that name will definitely not work. https://trademarks.justia.com/862/94/jarvis-86294162.html
IANAIPL but I find it difficult to believe that trademark is valid, considering it's never been used in trade. There is no Jarvis digital assistant software sold either fictionally or IRL. Even if the trademark were somehow upheld, I don't see how there could be any damages.
Trademarks are industry specific. If you made a fictional AI character named Jarvis and tried selling media based on that character THEN Marvel has a case. Creating a talking AI Assistant named Jarvis would be an expensive court case, which Marvel/Disney has to cash to pursue, but it would be a legal stretch with a lot of moneyed interests willing to back the non-Marvel/Disney side.
And just for those who felt the need to check:

> Computer application software that may be downloaded via global computer networks and electronic communication networks for use in connection with mobile computers, mobile phones, and tablet computers, namely, software for use as a voice controlled personal digital assistant

But there are many apps called Jarvis, so I am not sure how that is supposed to work?
So many that it is actually quite counterproductive to call it that way. I honestly have lost track of how many AI-based assistants named JARVIS I have encountered already =/.
What about Jenkins? oh yeah nope.

Um what about Jeeves? oh yeah nope.

Ok we need more butler names.

What about Smithers? Or Jeffrey?

Uh oh I hope I'm not in trouble
The way these things usually (but not always) work is they'll send you a cease and desist letter if they intend on bothering you. Change the name at that point and you're usually good.