Really cool that the image patches are converted to tokens with just a linear projection instead of a big embedding model! I wonder if that trick will prove viable for other multimodel media like audio.
Not using embeddings/lookup table means they can't generate image/audio, which to me it's a severe limitation. Why bother going to the process of generating a multimodal transformer if it's able to generate nothing but text?