|
|
|
|
|
by mchinen
15 days ago
|
|
The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it. > Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens. |
|