| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mchinen 15 days ago
	The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it. > Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

1 comments

make3 15 days ago

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

link

neosat 15 days ago

Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

link

aesthesia 15 days ago

Audio is 1 dimensional so the usual RoPE position encoding should handle it like it does for text tokens. You only need extra position encoding for higher-dimensional stuff like images.

link

mchinen 15 days ago

Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.

link

pseudollm 15 days ago

No there isn't - read the paper. It's just 40msec raw audio samples. Multiplied by one matrix to translate to 3800 input vector. That's it. The next 40 msec are fed in the next transformer input step. Without any positional encoding. Repeat ad infinitum

link