| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thatcherc 972 days ago
	Really cool that the image patches are converted to tokens with just a linear projection instead of a big embedding model! I wonder if that trick will prove viable for other multimodel media like audio.

1 comments

rafaelero 972 days ago

Not using embeddings/lookup table means they can't generate image/audio, which to me it's a severe limitation. Why bother going to the process of generating a multimodal transformer if it's able to generate nothing but text?

link

leodriesch 972 days ago

For an AI agent that should navigate a computer (which is Adepts use case IIRC) it should work, as it only has to output commands.

link

Philpax 972 days ago

Many applications only need input, not output.

link