| HN Mirror

prompt eval time = 17.93 ms / 24 tokens ( 0.75 ms per token, 1338.39 tokens per second) eval time = 2382.95 ms / 421 tokens ( 5.66 ms per token, 176.67 tokens per second) total time = 2400.89 ms / 445 tokens

is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.