Hacker News new | ask | show | jobs
by max93 558 days ago
We conducted similar research earlier and successfully improved performance to a level comparable to models with 3x larger layer sizes. https://arxiv.org/html/2409.14199v3 We utilize more computational time in the latent space to achieve better performance. However, this approach introduces greater resistance compared to Chain of Thought (CoT) reasoning in the token space, especially if the number of CoT rounds in the latent space exceeds 20. I would using the term "better approximation of the data distribution" instead of "reasoning" to describe this kind of process.
1 comments

So perhaps could be useful for fine-tuning on smaller models?
I think so. I believe this type of reasoning method, which achieves better results through longer computation time, is very useful on edge devices like mobile phones. Consider a scenario where we only need the model to output a function/action call on the phone; we don't require it to provide an immediate response.