Jarvis: A Voice Virtual Assistant in Python (OpenAI, ElevenLabs, Deepgram)

Y	Hacker News new \| ask \| show \| jobs

	Jarvis: A Voice Virtual Assistant in Python (OpenAI, ElevenLabs, Deepgram) (github.com)
	83 points by Alyx1337 910 days ago

14 comments

vessenes 910 days ago

I'm curious - what made you choose deepgram over just running whisper? I don't have any experience with deepgram but whisper has worked so well in my own tests that I didn't even ever consider there might be API speech recognition-only companies.

link

ty00001 910 days ago

There's three fantastic niche players in the speech-to-text market right now that you should check out: - Deepgram (cheap and dirty, but accuracy quite poor) - Speechmatics (a bit more pricey, but fantastic accuracy) - Assembly AI (just announced Series C funding of $50m)

link

iAkashPaul 910 days ago

Exactly, I don't think this project uses VAD for pausing LLM generation or interruptions in general which is key to good assistant interactions

link

Alyx1337 910 days ago

Deepgram advertised itself as being the fastest, and I wanted to focus on limiting response delay so I chose it. I hope I did not get misled.

link

java_beyb 910 days ago

well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.

Among local solutions, Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it. VOSK, DeepSpeech, Kaldi, et al were good once upon a time... Picovoice seems to be doing well.

I was planning to work on this: https://picovoice.ai/blog/chatgpt-ai-virtual-assistant-in-py... using Eleven Labs and Cheetah. Hope I can crave some time

link

jkachmar 910 days ago

unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].

[0]: https://github.com/ggerganov/whisper.cpp/tree/master/example...

[1]: https://github.com/ggerganov/whisper.cpp/blob/master/example...

[2]: https://whisper.ggerganov.com/stream/

link

java_beyb 909 days ago

have you tried it? i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.

but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.

can you use it? yes would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure. would i use it in my money-generating application? absolutely not.

link

cloudking 910 days ago

Wonderful hack, the overall response latency is the only thing that hurts the UX, if you can get the response time down would be epic. Nice work.

link

Alyx1337 910 days ago

Thanks! There are ways to shave off the latency: hosting locally, using quantized/smaller models, streaming data instead of doing the tasks sequentially

link

Spiwux 910 days ago

I wonder if we're at a point where you could build a voice assistant like that, except almost-realtime and streamed end to end:

User speaks and speech to text starts streaming text while the user is still speaking. That text stream is piped into a LLM, which also streams its output text. That output text is streamed to text-to-speech, which also generates audio in a streaming manner.

link

modeless 910 days ago

I implemented this! All local models. And I packaged it up so people can install it with one click: https://apps.microsoft.com/detail/9NC624PBFGB7

The speech recognition part needs work for sure, but when it works you can see the potential. It's very different from the way it feels to talk to Siri or even ChatGPT's voice mode. It won't be long before we are having real conversations with our computers.

link

bjelkeman-again 910 days ago

Could you record a demo of this?

link

modeless 910 days ago

I really should! I'm not the type to publish videos of myself usually, but it really does need a video demo.

link

3abiton 910 days ago

But how realtime is it?

link

modeless 910 days ago

The end-to-end response latency is around 1 second typically. It listens continuously, there are no buttons to press, and you can interrupt it while it's talking.

link

evilantnie 910 days ago

TTS and STT models have decent support for streaming in chunks, but the accuracy drops the smaller the chunk size. Current state of LLMs are pretty limited in their ability to handle streaming inputs due to attention window constraints. There is some emerging research into attention sinks and caching initial tokens that look promising. I don't think we're quite there yet though.

link

everforward 910 days ago

You can do the "almost-realtime" part, all locally. I tinkered with a Python script for a few hours that used Whisper to speech-to-text, fed that into a local Mistral model (don't recall which), and then piped the output into text-to-speech.

It wasn't really streamed, though. Audio input was buffered, fully evaluated to a string, then fed into the LLM and the full text was converted back to audio.

The Whisper speech-to-text was pretty real-time, the LLM was not. I was barely scraping by on hardware specs, though.

link

canadiantim 910 days ago

you try using ESP box?

link

zaptrem 910 days ago

Available as a phone line API (https://www.vocode.dev) and OS project (https://github.com/vocodedev/vocode-python)

link

adroitboss 910 days ago

This has happened already. It was maybe about 7 months ago and I believe it was a twitter link posted here. They took it further and streamed it to twilio to create a live phone call.

link

fudged71 910 days ago

The one I tried was called Vocode

link

WiSaGaN 910 days ago

Any existing stream api for llm input?

link

Jayakumark 910 days ago

Found Similar project but fully local for anyone interested.

https://github.com/KoljaB/LocalAIVoiceChat

This one looks similar to this but also does internet search, mail, music, smart home etc.. Hope that there is a standard interface that gets plugged into it. so anyone can develop addons.

https://github.com/KoljaB/Linguflex

link

Alyx1337 910 days ago

How did you find these? I was literally looking for tutorials all day long and could not find something. These projects look insane!

link

Jayakumark 910 days ago

Yes, was searching for Realtime STT and got a hit on GitHub , then looked at his other projects and found he builts up on his STT and TTS projects, it’s just 2 second latency on Local voice chat. Which is very good .

link

Alyx1337 910 days ago

Here is a video demo of the project: https://youtu.be/aIg4-eL9ATc?si=66ynl4Mlci9v76rU

link

alchemist1e9 910 days ago

Nice work! Very impressed.

Do you happen to know anything about any open source voice identification software?

I’ve noticed with ChatGPT voice and any other voice driven assistant that a massive problem is the background voices and noise. One solution could be advanced pre-processing to ID your voice only.

Another idea I’ve had is using something professional with PTT:

https://sheepdogmics.com/products/quick-disconnect-mic-tubel...

link

Jayakumark 910 days ago

Check whether this can help https://github.com/resemble-ai/resemble-enhance/tree/main

link

visarga 910 days ago

Google Gemini was trained on audio and can generate audio directly. Whatever you build now will be replaced by a much better version soon.

link

Alyx1337 910 days ago

Thanks! I don't know a lot about this but someone shared this local voice assistant in the comments: https://github.com/KoljaB/LocalAIVoiceChat Could be a good lead

link

alchemist1e9 910 days ago

Yeah github.com/KoljaB is quite a collection of stuff! I agree.

It all seems your vision of JARVIS, which I share completely but haven't accomplished what you have, again excellent work and thank you for sharing, is very attainable. Probably combining your work along with KoljaB is very promising.

link

Alyx1337 910 days ago

Thank you very much!

link

bloopernova 910 days ago

Somewhat amusing to consider that the (in-character) Marvel Cinematic Universe JARVIS could have been an LLM!

And of course Ultron is an asshole, it was trained on input from Tony Stark!

Back in 2008/9 I wondered just what would be required to run JARVIS, something you could converse with naturally, would understand what you meant, and be able to take care of complex mechanical tasks. The Iron Man suits have always been mostly Do-What-I-Mean (DWIM) managed by JARVIS or other AI agents, and now all of that seems to be attainable.

It's going to be an interesting time discovering just how well a human and AI agent can work together. I could see a military personal spotter, keeping track of enemy combatants, managing larger awareness of the battlefield, etc. I wonder how much a soldier could safely offload?

link

Alyx1337 910 days ago

Exactly my thought, I was like "Jarvis has got to be just a 2030 version of an LLM".

Yeah I actually considered making a spotter AI using computer vision in a game like ARMA 3 or Squad but kind of difficult. I made a spotter for ground vehicles on aerial imagery using YOLOv5 here: https://github.com/AlexandreSajus/Military-Vehicles-Image-Re...

There's a French defense company, Preligens, that actually does this currently

link

bloopernova 910 days ago

I imagine that within the next couple of years there's going to be a "general purpose vision" model (GPV? :)

More of a framework to perform the general purpose task of "recognize things in 30 (60? 120?) frames per second video and act on events in the video"

link

TuringNYC 910 days ago

We use this exact stack at work (OpenAI, ElevenLabs, Deepgram) for some exploratory use cases. The key issue we have now is latency with the LLM. Deepgram and Elevanlabs work brilliantly!

link

ty00001 910 days ago

Problem with this is Deepgram's accuracy (but agree their speed/latency is excellent). We used to use them too, but eventually we got so frustrated with poor accuracy we switched to Speechmatics - would definitely recommend checking them out.

link

Alyx1337 910 days ago

Great! What do you guys have in mind in terms of products using these tools. Yeah unfortunately it's hard to shave on latency.

link

TuringNYC 910 days ago

We do live in-studio briefings 3x/wk. These are both in-person and live-broadcast. The first thing we did was add an AI Co-Briefer who sits on the panel. The LLM latency makes it a bit hard, but it was a good experiment. The Deepgram worked brilliantly well with transcription across the entire studio, even for un-microphoned guest participants.

That live broadcast created a lot of buzz and numerous other use cases have popped up across the company. I'm working on a tech blog showcase next week to show it off on HN hopefully!

link

khaki54 910 days ago

There is another one (Also Jarvis) that's been around for a while and is more useful, wonder if they can combine forces? https://github.com/ggeop/Python-ai-assistant

Not sure if anyone has noticed but OpenAI now has a mobile app (I've been using the PWA all this time) and the voice assistant on there is really strong. Sounds good, fast, and seems to even run a pass on my voice before it submits the query.

link

chankstein38 910 days ago

I built this kind of thing for GPT-3 way back and then repurposed for 3.5 when I got API access to that. Though I used Whisper. I was hoping this would have wake word handling because that was what I struggled with but it appears that it just starts listening when you click a button or something.

link

Alyx1337 910 days ago

Yeah I had the same issue so I used (stole) this answer on StackOverflow: https://stackoverflow.com/questions/46734345/python-record-o... Basically there's a library that records until it detects a silence

link

mfld 910 days ago

I assume it could be made more responsive by using a streaming text-to-speech synthesis like ElevenLabs Cheetah. This approach was taken by the RoboDad recently discussed on HN. Btw, is there a streaming text-to-speech tool that supports languages other than english?

link

Jean-Papoulos 910 days ago

Here I was thinking about putting something like this in my home, and jokingly calling it Jarvis. This will be a great starting point, shame you can't choose the models you want to talk to (ie use local models instead of OpenAI), but great nonetheless !

link

Alyx1337 910 days ago

That was exactly my thought haha, I want Jarvis at home. You could easily modify my code to run a local LLM instead

link

ohthehugemanate 910 days ago

Actually there are a few LLM wrappers around that use the openai API spec (localai is a good one)... so you could just allow a configurable openai endpoint URI and technically users can swap in any model.

link

pogitalonx 910 days ago

Also check out Willow- https://heywillow.io

It doesn’t synthesize voice back (yet) but open source and runs all offline on ESP32-based hardware and works with HomeAssistant!

link

qainsights 910 days ago

Great tool. I also created Kel - AI assistant for terminal. Please check https://kel.qainsights.com

link

dukeofdoom 910 days ago

If you're really sitting at the computer for 24 hours...I have a family member that died from blood clots that formed when he sitting at his computer for too long.

link

bitsandbooks 910 days ago

"Jarvis" is a trademark of Marvel, so that name will definitely not work. https://trademarks.justia.com/862/94/jarvis-86294162.html

link

torstenvl 910 days ago

IANAIPL but I find it difficult to believe that trademark is valid, considering it's never been used in trade. There is no Jarvis digital assistant software sold either fictionally or IRL. Even if the trademark were somehow upheld, I don't see how there could be any damages.

link

bsenftner 910 days ago

Trademarks are industry specific. If you made a fictional AI character named Jarvis and tried selling media based on that character THEN Marvel has a case. Creating a talking AI Assistant named Jarvis would be an expensive court case, which Marvel/Disney has to cash to pursue, but it would be a legal stretch with a lot of moneyed interests willing to back the non-Marvel/Disney side.

link

djoldman 910 days ago

And just for those who felt the need to check:

> Computer application software that may be downloaded via global computer networks and electronic communication networks for use in connection with mobile computers, mobile phones, and tablet computers, namely, software for use as a voice controlled personal digital assistant

link

beardyw 910 days ago

But there are many apps called Jarvis, so I am not sure how that is supposed to work?

link

petemir 910 days ago

So many that it is actually quite counterproductive to call it that way. I honestly have lost track of how many AI-based assistants named JARVIS I have encountered already =/.

link

racl101 910 days ago

What about Jenkins? oh yeah nope.

Um what about Jeeves? oh yeah nope.

Ok we need more butler names.

What about Smithers? Or Jeffrey?

link

Alyx1337 910 days ago

Uh oh I hope I'm not in trouble

link

Mountain_Skies 910 days ago

The way these things usually (but not always) work is they'll send you a cease and desist letter if they intend on bothering you. Change the name at that point and you're usually good.

link