It's a sparse attention scheme. They store and reuse activations thus "memorising" the past without the need for training. In order to keep the sequence short enough to fit into memory they only recall the k most similar memories from a much larger log.