Hacker News new | ask | show | jobs
by Ey7NFZ3P0nzAe 16 days ago
I don't think so. Cramming new senses into the latent space of the model is one thing, but having a model output tokens that can be detokenized into sound is completely different and requires a very different type of data.
1 comments

What do you mean? Why not? We can already FFT sound into "words", so why not have some kind of dictionary to an arbitrary level of precision/fidelity?
Sound frequency is continuous : "between sound of A and O" means something, but "between the letter A and F" means nothing.