They say Hertz is first of its kind but Moshi is another duplex audio model from earlier this year that seems to perform similarly (and it runs on a MacBook):
https://github.com/kyutai-labs/moshi
Moshi never released the base model, only two conversationally finetuned models. They also never released training code except for the codec. Though I don't see any training code for Hertz either, just 3 inference notebooks, and model code full of no_grad. No paper either to help me understand how this was trained and what the architecture is like. So I'm not too sure about researcher-friendliness unless I'm missing something.
I'm not part of their team, but lived with them for a couple months. They've been working on it for ~5 months, and their background is 16-20 year olds who are too smart for university.
- LLaMA-Omni https://github.com/ictnlp/LLaMA-Omni a speech-language model built on Llama-3.1-8B-Instruct for simultaneous generation of text and speech
- Ichigo https://github.com/homebrewltd/ichigo open research project extending a text-based LLM to have native listening ability, using an early fusion technique
Moshi is a good model to build chat applications on, this is designed to be more of a proper base model with all the quirkiness, naturalness, and researcher-friendliness of base modeling.