| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshstrange 260 days ago

This is really neat. I cloned my voice and can generate text, but I can't seem to generate longer clips. The README.md says:

> Context Window: 2048 tokens, enough for processing ~30 seconds of audio (including prompt duration)

But it's cutting off for me before even that point. I fed it a paragraph of text and it gets part of the way through it before skipping a few words ahead, saying a few words more, then cutting off at 17 seconds. Another test just cut off after 21 seconds (no skipping).

Lastly, I'm on a MBP M3 Max with 128GB running Sequoia. I'm following all the "Guidelines for minimizing Latency" but generating a 4.16 second clip takes 16.51s for me. Not sure what I'm doing wrong or how you would use this in practice since it's not realtime and the limit is so low (and unclear). Maybe you are supposed to cut your text into smaller chunks and run them in parallel/sequence to get around the limit?