|
|
|
|
|
by max93
558 days ago
|
|
We conducted similar research earlier and successfully improved performance to a level comparable to models with 3x larger layer sizes. https://arxiv.org/html/2409.14199v3 We utilize more computational time in the latent space to achieve better performance. However, this approach introduces greater resistance compared to Chain of Thought (CoT) reasoning in the token space, especially if the number of CoT rounds in the latent space exceeds 20.
I would using the term "better approximation of the data distribution" instead of "reasoning" to describe this kind of process. |
|