Hacker News new | ask | show | jobs
by wwwlouishinofun 595 days ago
Tesla’s approach to pure vision-based autonomous driving—temporarily setting aside lidar and other sensors—seems designed to make this technology more accessible and scalable. By focusing on a vision-only model, they can accelerate adoption and gather large datasets for quicker iterations. Once the vision-based system reaches a mature stage, I imagine Tesla might reintegrate additional sensor data, like lidar or radar, to refine their autonomous driving suite, making it even more robust and closer to perfection.

Additionally, I’ve been exploring an idea about voice interaction systems. Currently, most voice interactions are processed by converting voice input into text, generating a text-based response, and then turning this text back into audio. But what if we could train the system to respond directly in voice, without involving text at all? If developed to maturity, this model could produce responses that feel more natural and spontaneous, possibly diverging from traditional text-to-speech outputs. Natural speech has unique syntax and rhythm, not to mention dialect and tone variations, which could make a purely voice-trained system fascinating and more human-like.

Could you let me know if your current voice interaction model follows the standard speech-to-text-to-speech process, or if there is exploration in voice-to-voice processing?

2 comments

I'm one of the devs. Our model is fully voice-to-voice, no text was involved in the making of hertz-dev for exactly this reason.
So essentially this is voice input to voice output? Can you change gender/age/accent? Does it track prosodic information? I've been waiting for something like this.
Hertz-dev is a base model, meaning it's just trained to predict the next token of audio. If your prompt is an old male voice with a British accent, the model will most likely continue speaking in an old male voice with a British accent. Being a base model, hertz-dev is easily finetunable for specific tasks - it would be a simple change to add manual configurations for the gender/age/accent.
I assume this mirroring is due to symmetry being more typical than not among the training data, and if instead trained with contrived diversity (e.g., males only conversing with females) then the output of the base model would follow suit without pulling any levers?

It's interesting to think about what complete diversity (i.e., no tendencies toward homogeneous conversation partners whatsoever among training data) would yield, given that it's trying to deliver whatever is most probable.

I'm interested to hear more detail about approaches to adding manual controls for speaker characteristics or emotion or other things you might want to vary. What techniques do you have in mind?
I’ll jump in here - as a former new englander, the cheerful helping tone of all modern voice llms infuriates me. And the slow speed. And the over explanations. ChatGPT advanced can be induced to talk more quickly, less sycophantically and if I like in a not-bad regional accent; essentially I want it to mirror my tone better. But those inducements don’t stick between sessions.

On the technical side having some sort of continuation or summarization loop on seems interesting to me as a product feature. It’s not enough to build a company off of though. But it would be nice.

Oh, you have completed the project I planned. Currently, do you think the difficulty in improving the model lies in voice data, computing power, or algorithm optimization? I personally think that if you want to achieve the ultimate, you don’t need to remove the background sound from the original audio. Outputting audio mixed with background sound as new audio may result in background music,

If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high. If you don’t have money to buy a GPU, just use voice noise reduction processing first.

Have you thought that this would be useful for an end-to-end translation for calls in Asterisk?
I think you're describing ChatGPT Advanced Voice Mode (or Realtime API) in your second paragraph.
They were so busy inventing they forgot to do a basic Google search to see if it had already been done.