|
|
|
|
|
by vvolhejn
233 days ago
|
|
I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels. But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want. Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759 At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926 |
|