Hacker News new | ask | show | jobs
by Ey7NFZ3P0nzAe 539 days ago
Has there been progress towards making RWKV multimodal? Can be use projector layers to send images to RWKV?
1 comments

There is work done for Vision RWKV, and audio RWKV, an example paper is here: https://arxiv.org/abs/2403.02308

Its the same principle as open transformer models where an adapter is used to generate the embedding

However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.

The tech is there, the base model needs to be better