|
|
|
|
|
by alyxya
36 days ago
|
|
The noteworthy things to me are that the architecture is a transformer that takes in text, image, and audio input and produces text and audio output, all trained together, and it works in near real-time through interleaving inputs and outputs rather than pure generation of the output from a given prompt. > Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities. That's probably the main thing that distinguishes it from the multimodal models from other frontier labs as far as I can tell. |
|
We can do these things today, but they're "bolted on" as afterthoughts. Yet they work remarkably well. I wonder how well they'd work if trained int his combined regime, from the ground up.