Hacker News new | ask | show | jobs
by blackbear_ 1487 days ago
> On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Train on test, improved performance on test. Wow.

2 comments

> Wow.

Transformers are very limited in the size of the attention window. They can take a few thousand tokens at maximum. But your data might not fit into the window, and you also don't want to have to fine-tune the model. This paper offers a solution.

It isn't being trained on test. Kind of the point of memory is that you can change the memory at will and don't need to train on new information you have never seen before.