|
|
|
|
|
by randomNumber7
16 days ago
|
|
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone. I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog) |
|