Hacker News new | ask | show | jobs
by Metricon 449 days ago
GGUF version created by "isaiahbjork" which is compatible with LM Studio and llama.cpp server at: https://github.com/isaiahbjork/orpheus-tts-local/

To run llama.cpp server: llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock

3 comments

I've been testing this out, it's quite good and especially fast. Crazy that this is working so well at Q4
Can somebody please create a gradio client for this as well. I really want to try this out but the complexity messes me up.
Wait, how do you get audio out of llama-server?
Orpheus is a llama model trained to understand/emit audio tokens (from snac). Those tokens are just added to its tokenizer as extra tokens.

Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.

The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.

got it, so inference in llama.cpp server won't actually get me any audio directly
If you run the `gguf_orpheus.py` file in that repository, it will capture the audio tokens and convert them to a .wav file. With a little more work, you can feed the streaming audio directly using `sounddevice` and `OutputStream`

On a Nvidia 4090, it's producing:

  prompt eval time =      17.93 ms /    24 tokens (    0.75 ms per token,  1338.39 tokens per second)

         eval time =    2382.95 ms /   421 tokens (    5.66 ms per token,   176.67 tokens per second)

        total time =    2400.89 ms /   445 tokens
*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.
is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.
You need to decode the tokens into audio. See `convert_to_audio` method in `decoder.py`

You can run `python gguf_orpheus.py --text "Hello, this is a test" --voice tara` and connect to the llama-server

See https://github.com/isaiahbjork/orpheus-tts-local

See my GH issue example output https://github.com/isaiahbjork/orpheus-tts-local/issues/15