| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by blixt 596 days ago
	They say Hertz is first of its kind but Moshi is another duplex audio model from earlier this year that seems to perform similarly (and it runs on a MacBook): https://github.com/kyutai-labs/moshi

3 comments

a2128 596 days ago

Moshi never released the base model, only two conversationally finetuned models. They also never released training code except for the codec. Though I don't see any training code for Hertz either, just 3 inference notebooks, and model code full of no_grad. No paper either to help me understand how this was trained and what the architecture is like. So I'm not too sure about researcher-friendliness unless I'm missing something.

link

nicholas-cc 596 days ago

We're working on a HuggingFace release that will help with finetuning. We'd like to do a paper, after a larger release - we're a team of 4.

link

netdevnet 596 days ago

Very impressive for just 4 people. What's the team background and how long have you been working on this?

link

programjames 596 days ago

I'm not part of their team, but lived with them for a couple months. They've been working on it for ~5 months, and their background is 16-20 year olds who are too smart for university.

link

unit149 596 days ago

For a rag-tag group of transcendental audiophiles operating electronic circuitry, it ionizes and atomizes well.

link

underlines 596 days ago

- LLaMA-Omni https://github.com/ictnlp/LLaMA-Omni a speech-language model built on Llama-3.1-8B-Instruct for simultaneous generation of text and speech

- moshi https://github.com/kyutai-labs/moshi speech-text foundation model using Mimi, a SOTA streaming neural audio codec

- Mini-Omni https://github.com/gpt-omni/mini-omni multimodal LLM based on Qwen2 offering speech input and output

- Ichigo https://github.com/homebrewltd/ichigo open research project extending a text-based LLM to have native listening ability, using an early fusion technique

link

nicholas-cc 596 days ago

Moshi is a good model to build chat applications on, this is designed to be more of a proper base model with all the quirkiness, naturalness, and researcher-friendliness of base modeling.

link