Hacker News new | ask | show | jobs
by cyrux004 992 days ago
This is pretty good. Do you think running models locally will be able to achieve performance (getting task done successfully) compared to cloud based ones.i am assuming for context of a drive through scenario it should be ok but more complex systems might need external infromation
1 comments

Definitely depends on the application, agreed. The more open ended the application the more dependent it is on larger LLMs (and other systems) that don't easily fit on edge. At the same time, progress is happening that is increasing the size of LLM that can be ran on edge. I imagine we end up in a hybrid world for many applications, where local models take a first pass (and also handle speech transcription) and only small requests are made to big cloud-based models as needed.
Can you share the source code? What did you do to improve the latency?
Lots of work around speculative decoding, optimizing across the ASR->LLM->TTS interfaces, fine-tuning smaller models while maintaining accuracy (lots of investment here), good old fashioned engineering around managing requests to the GPU, etc. We're considering commercializing this so I can't open source just yet, but if we end up not selling it I'll definitely think about opening it up.
Can you at least share the stack that you're using in building this? What kind of business model are you considering in commercializing it?
We're design the stack to be fairly flexible. It's Python/Pytorch under the hood, with the ability to plug and play various off the shelf models. For ASR we support GCP/AssemblyAI/etc, as well as a customized self-hosted version of Whisper that is tailored for stream processing. For the LLM we support fine-tuned GPT3 models, fine-tuned Google text-bison models, or locally hosted fine-tuned Llama models (and a lot of the project goes into how to do the fine-tuning to ensure accuracy and low latency). For the TTS we support Elevenlabs/GCP/etc, and they all tie into the latency reducing approaches.