|
|
|
|
|
by drdaeman
205 days ago
|
|
> The researchers ran the audio and motion data through smaller models that generated text captions and class predictions, then fed those outputs into different LLMs (Gemini-2.5-pro and Qwen-32B) to see how well they could identify the activity. Maybe I'm not understanding it, but as I get it, LLMs weren't really important: all they did was further interpreting outputs of a fronting audio-to-text classifier model. |
|