| HN Mirror

An LSTM takes a series of values and uses a combination of gates to determine critical information to hold on to or forget as a sequence unfolds. This is a compressive technique that removes the requirement of having all previous sequence information at the time of a particular inference.

This paper "compress sequence information into an anchor token" which is then used at inference time to reduce the information required for prediction as well as speed up that prediction. They do this via "continually pre-training the model to compress sequence information into the anchor token."