|
|
|
|
|
by rafaelero
972 days ago
|
|
Not using embeddings/lookup table means they can't generate image/audio, which to me it's a severe limitation. Why bother going to the process of generating a multimodal transformer if it's able to generate nothing but text? |
|