Hacker News new | ask | show | jobs
Arrows of Time for Large Language Models (arxiv.org)
6 points by tianlong 865 days ago
2 comments

Isn't it obvious that since LLM are trained to predict the next word they do better than to predict the previous one?
In the paper it is mentioned that the LLMs predicting the previous token are indeed pre-trained in this way, so it is not true that the difference is obvious.
There is a link with entropy creation?