Hacker News new | ask | show | jobs
by ekelsen 980 days ago
Image patches are projected directly into an embedding that goes into the decoder Transformer. The same thing could be done for audio.