Hacker News new | ask | show | jobs
by nicholas-cc 597 days ago
Hertz-dev is a base model, meaning it's just trained to predict the next token of audio. If your prompt is an old male voice with a British accent, the model will most likely continue speaking in an old male voice with a British accent. Being a base model, hertz-dev is easily finetunable for specific tasks - it would be a simple change to add manual configurations for the gender/age/accent.
2 comments

I assume this mirroring is due to symmetry being more typical than not among the training data, and if instead trained with contrived diversity (e.g., males only conversing with females) then the output of the base model would follow suit without pulling any levers?

It's interesting to think about what complete diversity (i.e., no tendencies toward homogeneous conversation partners whatsoever among training data) would yield, given that it's trying to deliver whatever is most probable.

I'm interested to hear more detail about approaches to adding manual controls for speaker characteristics or emotion or other things you might want to vary. What techniques do you have in mind?
I’ll jump in here - as a former new englander, the cheerful helping tone of all modern voice llms infuriates me. And the slow speed. And the over explanations. ChatGPT advanced can be induced to talk more quickly, less sycophantically and if I like in a not-bad regional accent; essentially I want it to mirror my tone better. But those inducements don’t stick between sessions.

On the technical side having some sort of continuation or summarization loop on seems interesting to me as a product feature. It’s not enough to build a company off of though. But it would be nice.