|
|
|
|
|
by earthnail
95 days ago
|
|
I don’t understand the approach > TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model. So basically just concatenating the audio vectors without compression or discretization? I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me. |
|