Hacker News new | ask | show | jobs
by mettamage 11 days ago
> The way it works is that a vision encoder (similar to what ChatGPT and Claude use) takes image pixels and translates them into the LLM’s token embedding space. The model does not “see” the image the way a human does. Instead, the vision encoder compresses the image into a sequence of vectors that live in the same mathematical space as text tokens. The LLM then processes those vectors as if they were just another sequence of tokens.

Could you also do this for music and specifically sound synthesis? It would be awesome to vibe synthesize sounds and then see the VSTi parameters surrounding it.

1 comments

I don't think so. Cramming new senses into the latent space of the model is one thing, but having a model output tokens that can be detokenized into sound is completely different and requires a very different type of data.
What do you mean? Why not? We can already FFT sound into "words", so why not have some kind of dictionary to an arbitrary level of precision/fidelity?