| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hexaga 463 days ago

Orpheus is a llama model trained to understand/emit audio tokens (from snac). Those tokens are just added to its tokenizer as extra tokens.

Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.

The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.

1 comments

thot_experiment 463 days ago

got it, so inference in llama.cpp server won't actually get me any audio directly

link

Metricon 463 days ago

If you run the `gguf_orpheus.py` file in that repository, it will capture the audio tokens and convert them to a .wav file. With a little more work, you can feed the streaming audio directly using `sounddevice` and `OutputStream`

On a Nvidia 4090, it's producing:

  prompt eval time =      17.93 ms /    24 tokens (    0.75 ms per token,  1338.39 tokens per second)

         eval time =    2382.95 ms /   421 tokens (    5.66 ms per token,   176.67 tokens per second)

        total time =    2400.89 ms /   445 tokens

*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.

link

thot_experiment 463 days ago

is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.

link