Hacker News new | ask | show | jobs
by thatcherc 972 days ago
Really cool that the image patches are converted to tokens with just a linear projection instead of a big embedding model! I wonder if that trick will prove viable for other multimodel media like audio.
1 comments

Not using embeddings/lookup table means they can't generate image/audio, which to me it's a severe limitation. Why bother going to the process of generating a multimodal transformer if it's able to generate nothing but text?
For an AI agent that should navigate a computer (which is Adepts use case IIRC) it should work, as it only has to output commands.
Many applications only need input, not output.