| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by drdaeman 205 days ago
	> The researchers ran the audio and motion data through smaller models that generated text captions and class predictions, then fed those outputs into different LLMs (Gemini-2.5-pro and Qwen-32B) to see how well they could identify the activity. Maybe I'm not understanding it, but as I get it, LLMs weren't really important: all they did was further interpreting outputs of a fronting audio-to-text classifier model.