Hacker News new | ask | show | jobs
by Q6T46nT668w6i3m 980 days ago
Neat idea! Are the batches encoded as tokens into the input sequence? This is something I really like about the multi-modal PALM papers since it enables the multi-modal tokens to be referenced.
1 comments

Image patches are projected directly into an embedding that goes into the decoder Transformer. The same thing could be done for audio.