| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by TheEzEzz 1001 days ago
	Completely agree, latency is key for unlocking great voice experiences. Here's a quick demo I'm working on for voice ordering https://youtu.be/WfvLIEHwiyo Total end-to-end latency is a few hundred milliseconds: starting from speech to text, to the LLM, then to a POS to validate the SKU (no hallucinations are possible!), and finally back to generated speech. The latency is starting to feel really natural. Building out a general system to achieve this low-latency will I think end up being a big unlock for enabling diverse applications.

10 comments

TheEzEzz 1001 days ago

Since this is getting a bit of interest, here's one more demo of this https://youtu.be/cvKUa5JpRp4 This demo shows even lower latency, plus the ability to handle very large menus with lots of complicated sub-options (this restaurant has over a billion option combinations to order a coffee). The latency is negative in some places, meaning the system finishes predicting before I finish speaking.

arcticfox 1001 days ago

Holy cow. That's better than the average human drive-through attendant.

jonplackett 1001 days ago

This is cool. But I want to see how it handles you going back one and tweaking it.

armini 1001 days ago

We've built something similar that allows you to tweak/update notes & reminders https://qwerki.com/ (private beta) here's the video demo https://www.youtube.com/shorts/2hpBTxjplIE we've since moved to training our own LLAMA as it's more responsive & we have better reliability.

cyrux004 1001 days ago

This is pretty good. Do you think running models locally will be able to achieve performance (getting task done successfully) compared to cloud based ones.i am assuming for context of a drive through scenario it should be ok but more complex systems might need external infromation

TheEzEzz 1001 days ago

Definitely depends on the application, agreed. The more open ended the application the more dependent it is on larger LLMs (and other systems) that don't easily fit on edge. At the same time, progress is happening that is increasing the size of LLM that can be ran on edge. I imagine we end up in a hybrid world for many applications, where local models take a first pass (and also handle speech transcription) and only small requests are made to big cloud-based models as needed.

wordpad25 1001 days ago

Can you share the source code? What did you do to improve the latency?

TheEzEzz 1001 days ago

Lots of work around speculative decoding, optimizing across the ASR->LLM->TTS interfaces, fine-tuning smaller models while maintaining accuracy (lots of investment here), good old fashioned engineering around managing requests to the GPU, etc. We're considering commercializing this so I can't open source just yet, but if we end up not selling it I'll definitely think about opening it up.

7_hours_ago 1000 days ago

Can you at least share the stack that you're using in building this? What kind of business model are you considering in commercializing it?

TheEzEzz 1000 days ago

We're design the stack to be fairly flexible. It's Python/Pytorch under the hood, with the ability to plug and play various off the shelf models. For ASR we support GCP/AssemblyAI/etc, as well as a customized self-hosted version of Whisper that is tailored for stream processing. For the LLM we support fine-tuned GPT3 models, fine-tuned Google text-bison models, or locally hosted fine-tuned Llama models (and a lot of the project goes into how to do the fine-tuning to ensure accuracy and low latency). For the TTS we support Elevenlabs/GCP/etc, and they all tie into the latency reducing approaches.

Breza 991 days ago

Neat! I appreciate your approach to preventing hallucinations. I've used something similar in a different context. People make a big deal about hallucinations but I've found that validation is one of the easier aspects of AI architecture.

nelox 1001 days ago

The voice does not seem to be able to pronounce the L in “else”. What’s happening there?

TheEzEzz 1001 days ago

Good question. Off the shelf TTS systems tend to enunciate every phoneme more like a radio talk show host rather than a regular person, which I find a bit off putting. I've been playing around with trying to get the voice to be more colloquial/casual. But I haven't gotten it to really sound natural yet.

g0atbutt 1001 days ago

This is a very slick demo. Nice job!

TheEzEzz 1001 days ago

Thanks! It's a lot of fun building with these new models and recent AI approaches.

arktiso 1001 days ago

Wow, the latency on requests feels great!! I’m really curious: is this running entirely with Python?

TheEzEzz 1001 days ago

100% Python but with a good deal of multiprocessing, speculative decoding, etc. As we move to production we can probably shave another 100ms off by moving over to a compiled system, but Python is great for rapid iteration.

mach1ne 1001 days ago

Manna v0.7

kortex 1001 days ago

Context for the unaware: https://en.m.wikipedia.org/wiki/Manna_(novel)

swsieber 1000 days ago

That's way slick.

Can I ask what your background is, and what things you're used to working with? I don't have the chops to build what you built, but I'd love to get there.

TheEzEzz 994 days ago

My advice is always to jump in and start building! My background is math originally, so I had some of the tools in my tool box, but I'm mostly self-taught in computer science and machine learning. I read textbooks, research papers, code repos, but most importantly I build a lot of stuff. Once I'm excited about an idea I'll figure out how to become an expert to make it a reality. Over the years the skills start to compound, so it also helps that I'm an old man!

simian1983 1001 days ago

That demo is pretty slick. What happens when you go totally off book? Like, ask it to recite the numbers of pi? Or if you become abusive? Will it call the cops?

TheEzEzz 1001 days ago

It's trained to ignore everything else. That way background conversations are ignored as well (like your kids talking in the back of the car while you order).

edge17 1000 days ago

How do you train for this?

yarone 1001 days ago

Nice work, very cool!