Hacker News new | ask | show | jobs
by nicholas-cc 593 days ago
I'm one of the devs. Our model is fully voice-to-voice, no text was involved in the making of hertz-dev for exactly this reason.
3 comments

So essentially this is voice input to voice output? Can you change gender/age/accent? Does it track prosodic information? I've been waiting for something like this.
Hertz-dev is a base model, meaning it's just trained to predict the next token of audio. If your prompt is an old male voice with a British accent, the model will most likely continue speaking in an old male voice with a British accent. Being a base model, hertz-dev is easily finetunable for specific tasks - it would be a simple change to add manual configurations for the gender/age/accent.
I assume this mirroring is due to symmetry being more typical than not among the training data, and if instead trained with contrived diversity (e.g., males only conversing with females) then the output of the base model would follow suit without pulling any levers?

It's interesting to think about what complete diversity (i.e., no tendencies toward homogeneous conversation partners whatsoever among training data) would yield, given that it's trying to deliver whatever is most probable.

I'm interested to hear more detail about approaches to adding manual controls for speaker characteristics or emotion or other things you might want to vary. What techniques do you have in mind?
I’ll jump in here - as a former new englander, the cheerful helping tone of all modern voice llms infuriates me. And the slow speed. And the over explanations. ChatGPT advanced can be induced to talk more quickly, less sycophantically and if I like in a not-bad regional accent; essentially I want it to mirror my tone better. But those inducements don’t stick between sessions.

On the technical side having some sort of continuation or summarization loop on seems interesting to me as a product feature. It’s not enough to build a company off of though. But it would be nice.

Oh, you have completed the project I planned. Currently, do you think the difficulty in improving the model lies in voice data, computing power, or algorithm optimization? I personally think that if you want to achieve the ultimate, you don’t need to remove the background sound from the original audio. Outputting audio mixed with background sound as new audio may result in background music,

If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high. If you don’t have money to buy a GPU, just use voice noise reduction processing first.

Have you thought that this would be useful for an end-to-end translation for calls in Asterisk?